Ever since its inception, adaptive beamforming has become an inevitable feature in smart antenna array to improve the spectrum efficiency. However, modern embedded wireless communication systems have imposed challenging constraints on adaptive algorithms when targeting a parallel and pipelined implementation on limited resource devices, like field programmable gate array (FPGA). Such constraints include reduced complexity, parallelism, accelerated convergence and low residual error. Several variants of classical adaptive beamformers were proposed to accelerate the convergence while maintaining a low error floor. Other suggestions focused on a parallel, pipeline architecture. The resulting beamforming algorithms either presented an improved convergence profile, at the cost of an increase of complexity or presented a pipeline hardware architecture without any significant improvement. To present a unified solution with superior convergence profile while maintaining a low complexity parallel pipeline architecture, we propose a two-stages algorithm, called parallel least mean square structure (pLMS). pLMS is further simplified to obtain the reduced complexity pLMS design (RC-pLMS). In order to design a pipelined hardware architecture, we applied the delay and sum relaxation technique (DRCpLMS). A study on the behavior and the performance of different hardware design tools and processor architectures is conducted. Computer simulations demonstrated the outstanding performance of RC-pLMS. The DRC-pLMS can operate at a maximum frequency of 208.33 MHz with a minor increase in resource usage compared to LMS.