# Image Processing Front End for Associative Memory-Based Systems - Hardware-Efficient Low-Power Motion-Picture Segmentation by Pipeline Processing of Tiled Images with Cell-Network -

Representative : Tetsushi Koide (Associate Prof., Research Center for Nanodeivices and Systems, Graduate School of Advanced Sciences of Matter)

Cooperator : Hans Juergen Mattausch (Prof., Research Center for Nanodeivices and Systems, Graduate School of Advanced Sciences of Matter), Takashi Morimoto (Graduate School of Advanced Sciences of Matter, D1), Yohmei Harada (Graduate School of Advanced Sciences of Matter, M2), Hidekazu Adachi (Faculty of Engineering), and Osamu Kiriyama (RCNS Researcher)

# 1. Research Target

The vision-based intelligent processing for associative memory-based systems with recognition and learning capability, as illustrated in Fig. 1, requires segmentation of objects and feature extraction/modeling for associative memory-based systems. Image segmentation is the extraction process of all objects from natural input images and is the necessary first step of object-oriented image processing such as object recognition or object tracking. For these front-end technologies of image segmentation and future extraction, we are investigating and implimenting new feature extraction and modeling methodologies suitable for associative memorybased systems. Fig. 2 shows a conceptual example of the aimed-at image segmentation and feature extraction systems.

Technologies for image recognition and object tracking are essential in research fields like robot vision and intelligent transport systems. If humanlike recognition of objects in natural complex images becomes possible, a wide range of applications in the industrial, public and private sector will profit. One important application field are hand-held mobile application systems. To realize such systems, 3 requirements have to be attained simultaneously, namely real-time processing, compact implementation, and low-power dissipation. Since visual information is generally complex and contains a multitude of informations, it's difficult to achieves these requirements with general purpose hardware like FPGAs, microprocessors or digital signal processors (DSP). A common strategy for reducing the complexity of the task is the extraction of important information (objects) from the natural complex image. To realize this strategy an image-processing function called *image segmentation* is required.



Many image segmentation algorithms have already been

Figure 1: Structure of envisaged associative memory-based systems.

proposed [1, 2]. But, nearly all of these algorithms are strongly software-oriented, so they cannot satisfy above 3 requirements simultaneously. In particular, hardware implementation is basically restricted to microprocessor or DSP because of algorithm complexity. Recently, some proposals for special purpose image-segmentation hardware have been made [3, 4]. However, they still have problems, namely with respect to power dissipation and chip size but also segmentation quality, when applied to real-time processing of standard size (e.g. VGA) motion pictures.

The aim of our project in the COE research is to develop a high-speed and high-density image segmentation algorithm and architecture, and the associative memory-based future extraction architecture, using the proposed image segmentation framework for real-time moving pictures, to enable visionbased intelligent processing.

# 2. Research Results

In this report, we explain the current status of the proposed cell-network-based segmentation architecture, applying a tiled subdivided-image approach (SIA) and a boundary-active-only (BAO) region-growing scheme for low-power mobile applications. A test-chip design of the cell-network core in 0.35um CMOS verifies that real-time VGA-size motion-picture segmentation becomes possible in standard digital CMOS at below 30mW power dissipation.

# 2.1 Cell-Network-Based Segmentation Architecture

The proposed architecture of a cell-network-based segmentation algorithm [5, 6] has following features: (1) regiongrowing approach, (2) pixel-based fully-parallel processing,



Figure 2: A block diagram of image segmentation and feature extraction for real-time applications.

and (3) gray-scale or color-image segmentation by only changing an initialization step for weight calculation.

Figure 3 shows the outline of the flowchart of the proposed segmentation algorithm. In an initialization phase (a), connection-weights are calculated from luminance (RGB-data for color images) differences between neighboring pixels. Then leader cells, which are the seeds of the subsequent regiongrowing processes, are determined from calculated connection-weights. In the main phase, self-excitation (b, c) and segment-growing (d, e) are executed for determining the individual segments. During the search phase of self-excitable cells (b), one of the leader cells is selected with a token-passing search. Then the selected leader cell is self-excited (c). In the subsequent region-growing process, excitable cells are determined with a threshold condition for the sum of connection-weights with excited neighbors (d). These excitable cells are then automatically excited (e) in parallel, which leads to a growth of this region. The growing operation is repeated as long as excitable cells exist in the loop (d)-(e). If there are no excitable-cell left, the region of excited cell is labeled as one segment and inhibited (f). The above self-excitation and region-



Figure 3: Examples of real image segmentation results for gray-scale (a)-(c) and color (d) images.



Figure 4: Block diagram of the cell-network-based image segmentation architecture with subdivided-image approach (SIA). The blocks with bold-line are added for SIA implementation.

growing phases are repeated until all leader cells are inhibited. Segmentation examples with the described algorithm are shown in Fig. 4.

The proposed cell-network-based VLSI implementation consists of 4 functional blocks (shown in non-bold form in Fig. 5), namely the connection-weight calculation (A), the leader cell selection (B), the image-segmentation cell-network (C) and the segmentation-restore circuit (D). The connection-weight calculation circuit (A) and leader cell selection circuit (B) execute the initialization phase (a). Calculated connectionweights and leader cells are transmitted to the cell-network (C) in column pipeline mode. In the cell-network (see Fig. 6), which consists of cells Pij and connection-weight-register blocks WRij, the steps (b)-(f) of the algorithm are carried out with pixelparallel processing. Each active cell P<sub>ii</sub>, corresponds to a pixel and is a processing element, which determines its current status in each clock cycle from the connection-weights to its excited nearest neighbors. The connection-weights are stored in connection-weight-register blocks WR<sub>ij</sub>, which are placed between the cells. Finally, segmentation-results are restored by segmentation restore circuit (D).

The active cell  $P_{ij}$  consists of decoder, adder/ subtractor, control unit and four 1-bit registers. We have 2-types of  $P_{ij}$ 



Figure 5: Block diagram of the cell-network-based image segmentation architecture with subdivided-image approach (SIA). The blocks with bold-line are added for SIA implementation.



Figure 6: Construction of the cell-network for 4×4 pixels. Vertical and horizontal registers are used for resource-sharing of the connection-weights.



Figure 7: Weight-parallel (a) and weight-serial (b) structure of active cells.

implementations, weight-parallel (WP, Fig. 7a, single clockcycle solution) and weight-serial (WS, Fig. 7b, multiple clockcycle solution) for high speed and high density, respectively. The main difference of these two implementations is the processing block for weight calculation surrounded by the dashed lines in Fig.7a, b.

The test-chip reported in [5] has a pixel integration density, PID, of 19.6pixel/mm<sup>2</sup>. Based on this test-chip design, we have estimated the possible PID for full-custom weight-parallel and weight-serial cell-network architectures in scaled-down CMOS technologies with 5 metal routing layers. Fig. 8 shows the estimation result at 90nm with a PID of 665pixel/mm<sup>2</sup> and a required chip-size of 116mm<sup>2</sup> (11mm x 11mm) for the QVGA image-size. If more than 5 metal layers are available, even higher pixel densities are possible by introducing dedicated routing layers for VDD and VSS. Since the typical chip-size for the cost-performance market in 2004 is expected at 195mm<sup>2</sup> (14mm x 14mm) from the International Technology Roadmap for Semiconductors (ITRS2002 Update), single-chip implementation of our weight-parallel architecture for QVGAsize image segmentation is predicted possible at the 90nm



Figure 8: Chip-size estimation for weight-parallel (WP) and weight-serial (WS) architectures at the 90nm technology node with 5 metal layers as a function of the image size.

Figure 9: Image segmentation time estimation of WP architecture for larger image sizes at 10MHz clock frequency.

300 350

technology node. We have also estimated the processing time by a software simulator for the weight-parallel architecture and tested several natural image samples. The simulated average processing time of the weight-parallel architecture, implemented in the fabricated test-chip, is shown in Fig. 9. Consequently, a real-time, full-color, QVGA-size image segmentation chip in a 90nm CMOS technology with 5 metal layers, should complete the frame segmentation within <250usec, even at a low clock frequency of 10MHz.

## 2.2 Subdivided-Image Approach (SIA)

The fast segmentation speed of the original architecture, explained in Section 2.1, can be exploited for reduced hardware cost by pipeline-processing of tiled images. The basic concept of this method, which we call *subdivided-image approach (SIA)*, is explained in Fig. 10. Input images are divided into tiles with an overlapped region of 1 row and 1 column. Then, each image tile is processed in sequential order in a correspondingly smaller cell-network. The segment label numbers of the pixels in the overlap region, are reused as *prelabels* for the segmentation of subsequent tiles. As illustrated in Fig. 11, segments extending over several tiles are thus identified without problems. In consequence, SIA avoids the necessity of a large-scale cell-network, which increases with the input image size, and enables compact integration.

The SIA architecture is realized as an extension of our original cell-network-based image segmentation architecture. Additional elements, shown as bold blocks in Fig. 6, are the SIA controller (E) and the label controller (F). Slight modifications of image segmentation cell-network and segmentation restore circuit are also required. The function of this proposed SIA architecture is shortly explained as follows. When the segmentation of a previous tile has finished, the SIA controller assigns the next image tile address *raddr* to input memory. Pixel data are streamed in column-wise and



Figure 10: Conceptual diagram of the subdivided-image approach (SIA). Pipelined segmentation of image tiles with a corresponding smaller-size cell-network is applied.



Figure 11: Processing example of the SIA approach. Prelabeled regions at the boundary of the tile enable correct segmentation of regions extending over several tiles.

connection-weights  $W_{ij}$  as well as leader cells  $pp_i$  are calculated from the luminance (or R, G, B) data  $I_i$  and are written into the cell-network in column-parallel pipeline mode. The prelabeled cells in the overlap region (left column and upper row) are forced to become leader cells by the SIA controller. Then tile segmentation is carried out in the cell-network. After completion of the segmentation process the label numbers of overlap row and column with subsequent tiles are stored in the label controller and are transferred to the cell-network for segmentation of the respective tiles. Tile segmentation finishes by storing the segmentation results in the image-segmentation memory with the segmentation restore circuit.

# 2.3 Low-Power Techniques

The guideline for low-power consumption is to keep all network cells, which are not involved in the growing process of the current segment, in a low-power stand-by mode.

## 2.3.1 Boundary-Active-Only (BAO) Concept

Since the architecture, described up to now, relies on pixelbased fully-parallel processing, power dissipation increases in proportion to the number of pixels. Without the SIA architecture, the power dissipation of a VGA-size cell-network would be about 1W. For battery-based mobile applications, substantial reduction of the power dissipation is therefore indispensable.

For this purpose, we propose a boundary-active-only (BAO) scheme as a low-power technique which doesn't sacrifice real-time processing. Instead, it effectively exploits the region-growing characteristics of the algorithm, namely that only the boundary cells of the currently grown region have to be in active mode. All other cells may be set to a lowpower stand-by mode as conceptually illustrated in Fig. 12. More precisely formulated, a network cell is kept in stand-by







Figure 13: Block diagram of the cell with BAO controller.



Figure 14: Circuit diagram of BAO controller in each cell for cell-internal power reduction.

mode, if it satisfies one of the 3 following conditions: (1) It is already excited  $(x_{ij}=1)$ . (2) It has already a segment number  $(l_{ij}=1)$ . (3) It is not excited and has no segment number, but there are no neighboring cells excited during the previous clock cycle t.

## 2.3.2 Implementation of the BAO Concept

The BAO architecture is implemented with both local and global approaches for power minimization. The local approach is to implement a BAO controller in each cell (see Figs. 13, 14), which examines the 3 stand-by-mode conditions and keeps the cell in stand-by mode, if at least one of these conditions is true. The cells, which satisfy none of the 3 conditions, can be activated by a gated-clock signal (cell CLK<sub>ii</sub>). Since the cellnetwork has long global clock-lines with large capacitances, global clock control is applied as a global approach for efficient power reduction. Figure 15 explains the global BAO implementation, which restricts clock distribution to potentially active network cells in the next clock cycle by boundary detection of the grown region. Detection of the region-growing boundary is no overhead because it is required anyhow for recognizing the region-growing end. The detection is carried out with an OR-function of the state signals of all network cells, indicating whether the respective cells have been included in the currently grown region during the previous clock cycle. Only cell-network rows i, where cells have been included in the grown region, will output a " $ZOR_i = 1$ " signal. The clock controller distributes the clock signal only to the neighboring rows {i-1, i, i+1} of the detected boundary cells in the next clock cycle.



Figure 15: Block diagram of global BAO implementation for power-reduction of clock distribution. Rows i containing region-boundary cells are detected ( $ZOR_i=1$ ) from the state signals of the row cells. The clock controller distributes the clock only to rows containing boundary cell and their nearest neighbor rows {i-1, i, i+1}.

We have estimated the power-saving potential of the proposed low-power BAO architecture by worst case analog circuit simulation with the HSPICE. Even for the small cell-network size of  $10 \times 10$  cells more than 75% power-reduction was achieved.

## 2.4. Designed CMOS Test-Chip Design

Segmentation speed and power dissipation of the pipelined SIA segmentation architecture depend on the number of image tiles. Therefore, we estimated the most suitable tile size for VGA images before starting the test-chip design. Figure 16 shows the estimated power dissipation and imagesegmentation time of the SIA architecture for 0.35um CMOS technology at 10MHz clock frequency as a function of the tile number. Obviously the image-segmentation time has a trade off with the number of image tiles and the power-dissipation, so that an optimum cell-network size can be chosen for each target application. For mobile applications with VGA-size video pictures (640×480 pixels), we have chosen following boundary conditions: processing time <8msec (giving some margin with respect to real-time processing) and power dissipation <50mW. The tile size satisfying these conditions is  $40 \times 32$  pixels. Including the overlap regions a segmentation network with 41×33 cells becomes necessary. The detailed parameter choice for VGA-size image segmentation with the SIA architecture is summarized in Fig. 17.



Figure 16: Estimated SIA performance data for VGA size images as a function of tile size (0.35um CMOS with 3-metal layer, 10MHz clock frequency).



Figure 17: Segmentation of a VGA-size image with subdivided-image pipeline processing.  $41 \times 33$ -pixel sized tiles are processed sequentially by the cell-network with BAO scheme.



Figure 18: Die photo of the network with BAO including  $41 \times 33$  cells. It is designed in a 0.35um 3-metal CMOS technology.

Table I: Characteristic data of the designed test-chip.

| Technology                | 0.35µm, 2-Poly 3-Metal CMOS     |
|---------------------------|---------------------------------|
| Cell Architecture         | Weight-Parallel (high-speed)[5] |
| Design Area               | 6.9mm×7.4mm (41×33 cells)       |
| Supply Voltage            | 3.3V                            |
| Max Clock Frequency       | 20MHz                           |
| Segmentation Time         | 23µsec@10MHz (Worst Case )      |
| (41×33 pixels)            |                                 |
| Power Dissipation         | 21.8mW@10MHz (Segmentation)     |
| (Simulated, 41×33 pixels) | 60.72mW@10MHz (Initialize)      |
| Pixel Density             | 26.5pixel/mm <sup>2</sup>       |

For the verification purposes, particularly of the proposed low-power BAO concept, we have designed a test-chip of the main functional unit of the SIA architecture, the image segmentation cell-network with 41×33 cells in 0.35um CMOS technology with 3 metal layers. The die photo of the fabricated test-chip is shown in Fig. 18. Cells and connection-weightregister blocks, enlarged on the right side of Fig. 18, are fullcustom designed, resulting in over 50% area reduction as compared to a standard cell based implementation. Simulated power dissipation with the HSIM circuit simulator [7] of the designed test-chip with implemented BAO concept amounts to 21.8mW and 60.7mW in segmentation and initialization phase, respectively. This is even less than the power dissipation of a previously designed 12 times smaller cellnetwork with 10×10 cells [5], which doesn't include the BAO concept and consumes 24.4mW at 10MHz clock frequency. The characteristic data of the designed image segmentation test-chip are summarized in Table I.

## 3. Conclusions

In this report, we have proposed a cell-network-based digital image segmentation architecture with pixel parallel processing for gray-scale/color images in real-time applications. A CMOS test-chip for the cell-network, which is the main functional stage, has been fabricated, in a 0.35um CMOS technology and verifies the effectiveness of our proposal. In the performance verification of the test-chip, high speed segmentation in <9.5usec and low power dissipation of <36.4mW@10MHz are measured. The extrapolation results to larger image sizes suggest, that QVGA-size image segmentation will be possible within 300usec@10MHz at the 90nm CMOS

technology node. Furturemore, we have proposed a low-power and hardware- efficient pipelined segmentation architecture for VGA-size motion pictures, which applies a subdividedimage approach (SIA) for compact implementation and a boundary-active-only (BAO) scheme for low-power dissipation. We have verified the effectiveness of the proposed architecture with 51mm<sup>2</sup> test-circuit in 0.35um CMOS technology for the segmentation-network core consisting of 41×33 cells. The segmentation performance for a VGA-size input image is 21.8mW power dissipation and 7.49msec segmentation time at 10MHz clock frequency.

## 4. Future Plan

The future work includes the test-chip design of a large size cell-network, implementing also peripheral circuits such as the circuits for connection-weight calculation and leader cell selection. The improvement of the architecture for low power dissipation and the development of a complete image segmentation system are further important topics. Currently, we are planning the moving object tracking architecture based on the proposed cell-network based image segmentation architecture and a fully-parallel area-efficient minimum-Manhattan-distance search associative memory. The development of the prototype architecture of the moving object tracking system is an immediate research topic. Moreover, architecture/circuit development for the feature-extraction unit, which requires also selection of concrete application examples is the next step in our research effort towards the complete associative memory-based information processing system.

#### Acknowledgments

The test-chips in this study have been fabricated in the chip fabrication program of VLSI Design and Education Center (VDEC), the University of Tokyo in the collaboration with Rohm Corporation and Toppan Printing Corporation. HSIM is licensed within the academic collaboration program of Nassda Corporation.

## References

- [1] J. C. Russ, "The Image Processing Handbook," pp. 371-429, CRC PRESS, 1999.
- PRESS, 1999.
  [2] B. Jähne, "Digital Image Processing, 5th revised and extended edition," ch. 16, pp. 427-440, Springer, 2002.
  [3] S. Y. Chien, et al., "Single chip video segmentation system with a programmable PE array," Proc. of 2002 IEEE Asia-Pacific Conf. on 1002 active ac
- ASICs, pp. 233-236, 2002.
  [4] H. Ando, et al., "An image region extraction LSI based on a merged/mixed-signal nonlinear oscillator network circuit," Proc. of the 28th
- [5] T. Morimoto, et al., "Low-complexity, highly-parallel color motion-picture segmentation architecture for compact digital CMOS implementation," Extend. Abst. of the 2002 Int. Conf. on Solid State Devices and Matricele are 242 242 2002. State Devices and Materials, pp. 242-243, 2002. [6] T. Morimoto, et al., "Efficient video-picture segmentation
- algorithm for cell-network-based digital CMOS Implementation, IEICE Trans. on Information & Systems, Vol.E87-D, No.2, pp. 500-503, 2004.
- [7] Nasda Corporation, "HSIM, Ver.3.0", http://www.nasda.com/, 2004.

## 5. Published Papers and Patents

## (1) Published Papers

T. Morimoto, Y. Harada, T. Koide, and H. J. Mattausch, "Efficient video-picture segmentation algorithm for cell-network-based digital CMOS implementation," IEICE Trans. on Info. & Sys., Vol.E87-D (2) (2004) pp. 500-503.

## (2) Proceedings

- T. Koide, T. Morimoto, Y. Harada, and H. J. Mattausch, "Digital gray-scale/color image-segmentation architecture for cellnetwork-based real-time applications," Proc. of The 2002 Int'l Tech. Conf. on Cir. & Sys., Computers. and Communications (ITC-CSCC2002), pp. 670 -673, 2002
- 2. T. Morimoto, Y. Harada, T. Koide, and H. J. Mattausch, "Realtime segmentation architecture of gray-scale/color motion pictures and digital test-chip implementation," Proc. of The 2002 IEEE
- Asia-Pacific Conf. on ASICs (AP-ASIC2002), pp. 237-240, 2002. T. Morimoto, Y. Harada, T. Koide, and H. J. Mattausch, "Low-3 complexity, highly-parallel color motion-picture segmentation architecture for compact digital CMOS implementation," Ext. Abs. of the 2002 Int'l Conf. on Solid State Devices and Materials (SSDM2002), pp. 242-243, 2002. Y. Harada, T. Morimoto, T. Koide, and H. J. Mattausch, "CMOS
- 4 test chip for a high-speed digital imag-segmentation architecture with pixel-parallel processing," Proc. of the 2003 Int'l Tech. Conf. on Cir. & Sys., Computers and Communications (ITC-CSCC 2003), pp. 284-287, 2003. T. Morimoto, Y. Harada, T. Koide, and H. J. Mattausch, "Low-
- 5. power real-time region-growing image- segmentation in 0.35um CMOS due to subdivided-image and boundary-active-only architectures, " Ext. Abs. of the 2003 Int'l Conf. on Solid State Devices and Materials (SSDM2003), pp. 146-147, 2003. T. Morimoto, Y. Harada, T. Koide, and H. J. Mattausch, "350nm
- 6. CMOS test-chip for architecture verification of real-time QVGA color-video segmentation at the 90nm technology node," Proc. of the Asia South Pacific Design Automation Conf. 2004 (ASP-DAC2004), pp. 531-532, 2004. O. Kiriyama, T. Morimoto, H. Adachi, Y. Harada, T. Koide, and
- H. J. Mattausch, "Low-power design for real-time image segmentation LSI and compact digital CMOS implementation Proc. of The 2004 IEEE Asia-Pacific Conf. on ASICs (AP-ASIC2004), 2004, to appear.

#### (3) Patents

- "Image segmentation method, image segmentation apparatus 1. image segmentation method, image segmentation apparatus, image processing method, and image processing apparatus", JPN Patent Application No. 2002-152491, 2002.05.27. "Image segmentation method, image segmentation apparatus, image processing method and image processing apparatus."
- 2. image processing method, and image processing apparatus", USA Patent Application No.10/445,247, 2003.05.26
- 3. "Image segmentation method, image segmentation apparatus, image processing method, and image processing apparatus", EPC Patent Application No.03011840.0, 2003.05.26.
- 4. "Image segmentation method, image segmentation apparatus. image processing method, and image processing apparatus", KOR Patent Application No.2003-33324, 2003.05.26.
- 5. "Image segmentation method, image segmentation apparatus image processing method, and image processing apparatus", TWN Patent Application No.92114142, 2003.05.26.
- "Image segmentation apparatus, image segmentation method, and image segmentation integrated circuit", JPN Patent Application 6. No. 2003-322163, 2003.09.12.
- "Image segmentation apparatus, image segmentation method, and image segmentation integrated circuit", USA, EPC, KOR, TWN 7. Patent Application No. TBD, 2004.05.31.

#### (4) Awards

T. Morimoto, Y. Harada, T. Koide, and H. J. Mattausch, "A realtime picture-segmentation architecture for intelligent information processing•C" The 4<sup>th</sup> LSI IP Design Award, Development Encouragement Award, LSI IP Design Award Committee, 2002.5. URL http://ne.nikkeibp.co.jp/award/

#### (5) Others

- T. Morimoto, Y. Harada, T. Koide, and H. J. Mattausch, "Gray-1. scale/color image-segmentation architecture based on cell-network," Technical Report of IEICE, VLD2002-48, pp. 49-54, 2002, in Japanese. Y. Harada, T. Koide, and H. J. Mattausch, "LSI chip design for
- 2. fully pixel parallel image segmentation cell-network, the 53<sup>rd</sup> 2002 Annual Technical Conference of the Chugoku Chapter of the Electronics and Information Institut • Cpp. 591-592 • C2002, in Japanese.
- 3. T. Morimoto, T. Morimoto, Y. Harada, T. Koide, and H. J. Mattausch, "A cell-network-based image segmentation LSI for real-time applications," The 5<sup>th</sup> IEEE Hiroshima Student Symposium (HISS), pp. 221-224, 2002, in Japanese.
- 4 O. Kiriyama, T. Morimoto, H. Adachi, Y. Harada, T. Koide, and H. J. Mattausch, •gLow power design for cell-network based image segmentation LSI, •h IEICE General Confere nce, No. C-12-10, p.112, 2004, in Japanese.