Modern interconnects often have programmable processors in the network interface that can be utilized to offload communication processing from host CPU. In this paper, we explore different schemes to support collective operations at the network interface and propose a new collective protocol. With barrier as an initial case study, we have demontrated that much of the communication processing can be greatly simplified with this collective protocol. Accordingly, %with our proposed collective processing scheme, we have designed and implemented efficient and scalable NIC-based barrier operations over two high performance interconnects, Quadrics and Myrinet. Our evaluation shows that, over a Quadrics cluster of 8 nodes with ELan3 Network, the NIC-based barrier operation achieves a barrier latency of only 5.60$\mu$s. This result is a 2.48 factor of improvement over the Elanlib tree-based barrier operation. Over a Myrinet cluster of 8 nodes with LANai-XP NIC cards, a barrier latency of 14.20$\mu$s over 8 nodes is achieved. This is a 2.64 factor of improvement over the host-based barrier algorithm. Furthermore, an analytical model developed for the proposed scheme indicates that a NIC-based barrier operation on a 1024-node cluster can be performed with only 22.13$\mu$s latency over Quadrics and with 38.94$\mu$s latency over Myrinet. These results indicate the potential for developing high performance communication subsystems for next generation clusters.