Due to the close relationship of object detection with video analysis and image perception, it has gained a great deal of momentum in research over the years. In computer vision, detecting human-object interactions is a fundamental problem because it provides semantic information about the interactions between detected objects. To process an extensive amount information from the video data, a deep learning framework using YOLOv3 and YOLOv4 for the problem of human-object interaction detection is used. Real life scenarios containing human activities (using cell phone) recorded via camera can be addressed via static images, videos, real-time webcam, and real-time CCTV surveillance. All these possible dimensions have been covered in this paper along with the count of mobile phone users. The use of mobile phone action recognition in prohibited areas has been addressed by the detection of objects predicted by the bounding box. Two public datasets, HICO-DET and MS-COCO are used for the training and evaluation of the model. Experimental results and analysis are produced to compare the YOLOv3 and YOLOv4 algorithms before and after applying 4k cross-validation. Since, YOLOv4 is an improvement on the YOLOv3 algorithm, it requires high end machine with GPU where as YOLOv3 is compatible with commonly available machine. Therefore, we have compared our results of commonly available machine and for a dedicated machine. The results indicated that the performance of YOLOv4 is far better than that of YOLOv3. The limitations of the existing framework and some improvements on it are suggested in this research paper.
Index Terms- Action recognition, Computer vision, Deep Learning, Human interaction, Human action recognition, Object detection.