GIFT: Gesture-Based Interaction by Fingers Tracking, an Interaction Technique for Virtual Environment

or manipulated. At the detection of the finger-tips, a Virtual Hand (VH), indicating the user’s position in the VE, is activated. The VH moves freely along x, y and z-axis with the movements of the

hand. Navigation and panning are performed by the feasible movements of the hand. As bare-handed gestures are critical [35] and the recognition accuracy of gestures can be improved with coloured-markers [35], therefore the two contrast coloured markers are used for the robust and reliable detection of the fingers. Instead of extracting features from an entire hand posture [36], interactions are performed by the positions of the two fingers, hence fewer computation is ensured.
Using the libraries of OpenCV and OpenGL, the technique is implemented in a project; IGF. Each time, the dynamic positions of the fingers in a scanned image frame are traced using an ordinary camera. By the relative 2D positions of the fingers, a 3D interaction is performed inside the designed VE. In a normal lighting environment, the system is evaluated for a total of 570 interactions by 15 participants. The satisfactory accuracy rate (94.3%) approves suitability of the technique in VR applications.
The paper is organized into 8 main sections. Section II is about the previous related research. The technique with its algorithm and mathematical details is covered in Section III. Details about the interactions are covered in Section IV. Implementation with evaluation details is explained in Section V. A comparative analysis is performed in Section VI whereas applicability in games is investigated in Section VII. The last section of the paper is about conclusion and future work.

II. Literature Review
With the privilege of interaction, a VR-user maintains the belief of being there and acknowledges a VE as real as it gets. To achieve a high degree of naturalness, the interface of a VR system should be simple and consistent [37]. The conventional way of interaction with keyboard and mouse are insufficient to properly engross users [11]. As human gestures are flexible and natural [38], therefore gesture/ posture based interactions are suitable for VR applications [15]. Various gesture-based techniques have been proposed in the literature of VR for intuitive interactions. The magnetic tracker's based system [39] provides 6-DOF for selection and DeSelection necessitates the use of both hands simultaneously. The haptic feedback interface for virtual objects manipulation of [40] needs the cumbersome setup of wires. The system is useable only inside a restricted workspace. The approach of different gestures for different commands is followed in the system; IVEAS (Immersive Virtual Environment Authoring System) designed by Lee et al. [41]. The technique is applicable for the basic interactions, however, the use of CyberGlove and Polhemus Fastrak for the detection of the finger joints and the required setup adds in discomfort. The tangible 3D system designed by Kim et al. [42] for the manipulation of 3D objects supports only five static gestures of hand. Reifinger et al. [43] presented their infrared based tracking system where infrared markers are used to track both static and dynamic hand gestures. The system is applicable in a wide lighting range but needs an array of six IR (Infrared) cameras besides a high-cost Head Mounted Display (HMD). Valuable research has been investigated about the use of Fiducial markers in VR and AR systems [44]. Gestures of the hand are traced by the FingARtips system of Buchmann et al. [45] for the manipulation of a virtual object. The unique Fiducial markers are used to distinguish different fingers. The use of full body tracking device; Microsoft Kinect has also been investigated for 3D interactions [46,47]. Although Kinect is applicable for tracking body-gestures [48], it cannot differentiate individual fingers [49]. The technique of Jin et al. [50] detects the fingers movements with the help of the multi-sensors Leap Motion Controller (LMC). The system supports faster navigation with a satisfactory accuracy rate if used with HMD. However, due to the limited workspace of LMC [51], the accuracy of the system falls if gestures are posed outside a short range. Summarizing the challenges of gesture-based interactions, Benko [52] pointed out the key issue of false tracking [41]. This research work is an attempt to minimize false tracking by recognizing gestures of the two fingers with the help of an ordinary camera.

III. GIFT: The Proposed Technique
This research work intends to present a gesture-based direct interaction [11] technique for interactions in a VE. All the interactions are performed on the basis of the dynamic positions of the two fingers.
Instead of following the computationally costlier methods of gesture recognition, the two finger-caps of colour green (for index) and pink (for thumb) are used. Lest a same-colour background object is detected, first a Region Of Interest (ROI_Img) is extracted from dynamically scanned Frame-Image (Fr_Img) on the basis of skin color.   The system works in three phases; Exploration Phase (EP), Translation Phase (TP) and Manipulation Phase (MP). Initially, the system starts with the EP and supports navigation and panning to explore a synthetic world. Translation and manipulation (scaling and rotation) are performed along an arbitrary axis in the TP and MP respectively. Switching between any two phases is activated by posing a distinct gesture. The three intuitive gestures of the two fingers; Openpinch, Closed-pinch and Cross-sign are used to switch the system from one phase to another. Open-pinch is the posture where the two fingers are gently apart, see Fig. 2  The open-pinch and closed-pinch gestures are used to activate MP and TP respectively. The cross-gesture is to switch the system back to the default EP. The object to be manipulated is inscribed inside two concentric cubes, as shown in Fig. 3. The outer cube inscribing the whole of an object specifies the Outer Aura (OA). The inner cube marking the central portion of an object represents the Inner Aura (IA). To translate an object, the IA selection of the object needs to be performed using the closed-pinch gesture. Similarly to activate manipulation of the object, the object should be selected for the OA using the open-pinch gesture. In order to visualize the switching between different phases, the appropriate cube around an object is highlighted. To explore a synthetic world, the system starts with the EP and supports only navigation and panning. Navigation and panning are performed outside the SM zone. Dynamic Areas (DAs) of the fingers are compared with CAs for going inside into the VE (forward navigation) or for reverse navigation (backward navigation). Within the SM zone, translation and manipulation are performed based on the attained state of the system (TP or MP).

A. The Architecture of the Proposed Technique
Besides rendering a virtual scene at the front-end, the system extracts the ROI-Img based at the back-end. The ROI_Img is converted to HSV and is then thresholded for a broader range of green colour. At the detection of the thimble of the index finger, the ROI_Img is split horizontally at the center of the cap of the index finger to get the Down-Section Image (DS_Img), see Fig. 4 Threshold of Index- The DS_Img is thresholded for pink color to find out the thumb-tip in the image. When both the fingers are traced, coordinates mapping between the fingers in image frame and the VH in the VE is initiated. As long as both the caps are visible, the VH can move freely with the movements of the fingers to access a 3D object. Navigation is performed by tracing a change in areas of the finger-caps while panning is performed on the basis of dynamic positions of the fingers. If an object is selected with the open-pinch posture (OA selection), the MP is activated. Similarly, bringing the VH over an object and posing a closed-pinch (IA selection), the object is selected for translation (TP). User is informed about either of the states by highlighting the respective cube inscribing the object. Schematic of the entire system is shown in Fig. 5.

B. Image Segmentation
Image segmentation refers to the extraction of a set of pixels [53]. At the outset, ROI_Img is extracted to avoid the chances of false detection if similar colors are present in the background. The ROI_Img is supposed to be the most probable section of FR_Img containing both of the fingers. As the YCbCr space is best to differentiate between skin and non-skin colors [54], therefore the YCbCr model is followed for skin color segmentation.  After getting the binary image of the FR_Img, see Fig. 6, the ROI_Img with rows 'm' and columns 'n' is extracted from the FR_Img using our designed algorithm [55] as, where L m , R m and D m represents Left-most, Right-most and Downmost skin pixels. The segmented ROI_Img is then thresholded for green and pink colors using the HSV color space. (3)

C. Coordinates Mapping
A challenging task in the research was to harmonize the pixels representing the finger-caps with the coordinates of the VH.  To harmonize the dissimilar coordinate systems, we devised our own mapping function, f [55]. The image frame is virtually split into four regions R 1 to R 4 as shown in Fig. 8. Mapping is made by the corresponding function taking 'x' and 'y' of a pixel of R n as independent variables. The OpenCV pointer variable; CP is used to locate and move the virtual hand based on the position of both the fingers.

If
represents the CP in the initial frames and represents the dynamic position of the CP in any scanned frames, then the virtual hand's position; is calculated as, Where k is the speed constant; greater the value of k speedier will be the movements of the VH. The value of 'Tc' and 'Tr' represents the total number of columns and rows respectively. We kept a moderate value of k in the IGF project. Furthermore, to ignore the unintentional movements of the fingers the mapping is performed only when the value of or is greater than five pixels. The mapping process is shown in Fig. 9.

IV. Interactions Support
All the basic 3D interactions are performed by the intuitive gestures of the fingers. The 2D positions of the fingers traced in the initial frames are supposed to be in a moderate range from the camera. Hence, the zone; SM is set as soon as the fingers' tips are recognized by the system.

A. Navigation
Navigation is to explore a VE. Mostly, a user needs to navigate to a proper location before performing selection or manipulation. In the proposed technique, forward or backward movement of the hand outside the SM zone performs navigation. As plausible, the forward navigation is carried out by the forward movement of the hand (see Fig. 10). Similarly, backward navigation is performed by the backward movement of hand outside the SM zone. To deduce the forward or backward hand movements in a scanned 2D image, the DAs (Dynamic Areas) of the fingers-caps are calculated and are compared with the CAs at runtime. By crossing the forward limit of the SM in the z-axis, DAs increase, see Fig. 11

SM Zone
where, x={index, thumb} and y > 1. To avoid the possibility of unintended increase/decrease in the DAs, the Fingers Average Dynamic (FAD) positions along the x-axis (FAD.x) and y-axis (FAD.y) are also checked against CP.x and CP.y, as clear from the following pseudo-code.

B. Selection DeSelection
Selection is choosing an object for interaction. Selection for scaling and rotation is made by bringing the VH over the object and posing an open-pinch gesture. With this, the OA selection is performed whereas the outer cube inscribing the object is highlighted. Similarly, an object is selected for translation by posing a closed-pinch to perform the IA selection. By posing a cross gesture by the two fingers deselects the object. The system recognizes the gesture when the fingers exceed a specified threshold along x-axis from the horizontal central positions of the fingers. If ID and TD represent Index-Dynamic and Thumb-Dynamic positions then, pseudo code for the DeSelection is given as, Once an object is deselected, the system switches back to the default EP.

C. Translation
Translation is changing the position of a virtual object along x, y and/or z-axis in a VE. Like gripping an object in the real world, the closed-pinch gesture is used to select an object for translation. To perform translation of an object, the system needs to be in the TP. The transition from the EP to the TP is made by hovering the VH over an object and posing the pinch-gesture. Besides switching from the EP to TP, the IA selection of the object is performed. After the selection, translation is performed in the VE by the movements of the hand.
The following condition is checked to ensure the closed-pinch posture.
Unless the fingers' tips are made apart, the IA selection of the object remains intact. During translations, the dynamic coordinates of the VH are assigned to the selected object to translate the object by the free movement of hand accordingly. After the IA selection of an object Obj, the 3D coordinates of the Obj are obtained as,

D. Rotation
Within the SM zone, a selected object is rotated using the ChordBall technique [56]. As perceivable, moving the VH along the x-axis with the open-pinch rotates the object along the y-axis. Similarly, movement of the VH along y-axis rotates the object along the x-axis. The angle θ of rotation is calculated using the distance (d) of the fingers from the central position (CP). Greater the distance, larger will be the angle of rotation.
The following algorithm is followed to perform rotation about an axis, )

Rotation about y-axis
Rotation about z-axis

E. Scaling
After the OA selecting with the open-pinch, an object can be scaled up or down. With the activation of the MP, the initial distance between the two fingers is traced. In the MP, a selected object is scaled or scaledown by comparing the dynamic distances with the initial distance between the fingers. A gentle increase in the distance between the thumb and index fingers scales up the object and vice versa. If the gesture is parallel to the x-axis then up or downscaling along the x-axis is performed. Scaling or downscaling along the y-axis is performed if the gesture is parallel to the y-axis as shown in Fig. 12(a) and Fig.  12  Scaling or scale-down about the z-axis is enacted by increasing or decreasing the diagonal distance respectively, see Fig. 13.

V. Implementation and Evaluation of the Technique
To systematically evaluate the proposed approach, the technique is implemented in the case-study application; IGF (Interaction by the Gestures of Fingers). The two open libraries; OpenGL and OpenCV are used for the front-end rendering and for the back-end image processing respectively. A Corei3 laptop with 2.30 GHz processor and 4GB RAM was used for the implementation and evaluation. The built-in camera of the laptop (resolution 640x480) was used to capture the image streams at runtime. Fifteen participants, all male, of ages between 25 and 40 (mean=31, SD=6) performed the six predefined tasks. Each participant performed two trials of the tasks. Before starting the evaluation session, participants of the evaluation were introduced to the system and pretrials were performed by the users.
In the IGF application, the VH moves freely in the environment where the coordinates of virtual camera change with the fingers movement accordingly. The z-axis of the VH is intentionally kept constant so that to be visible for interaction everywhere in the VE. Activation of the entire system is conditioned to the visibility of the fingers' caps. The users are constantly informed by the text "DETECTED" displayed in the upper centre part of the scene. Similarly, a user is notified both with text and a beep-signal (audio) when an interaction is activated, see Fig. 14.

A. Testing Environment and the Experimental Tasks
The 3D environment designed for the evaluation of the technique contains six tables; two on either side and two in the middle. The VH made of cubes represents the user's position in the environment. For easy noticing, endpoint of the scene is marked by a board with the text "Stop". To enhance immersivity, the scene is designed with different 3D objects at different positions. By hitting the Enter key, the system restarts with a new trial/task.
Participants were asked to perform the following six tasks in the designed 3D environment. The tasks cover the basic interaction operations, i.e. Selection, DeSelection, Translation, Scaling, Rotation and Navigation. A 3D object (Teapot) is rendered in the mid of the scene for selection and manipulation. While performing these tasks, selection and deslection are evaluated five times, rotation three times while translation, navigation and scaling two times each in a single trial. Missed or false detection of the system, after posing the specified gestures, were counted as errors. With this setup, overall accuracy rate for all the interaction tasks, as shown in Table I, is 94.3%. Comparatively more errors occurred during translation and navigation which were due to quicker move of hand or both of the fingers at the same time. In such cases, the camera misses tips of the fingers. This implies that the system performance can be improved with a high-efficiency camera.

B. Learning Effect
The learning effect was assessed from the errors occurrence rate. The paired two-sample T-test was used to analyze differences in means of the two trails. With the null hypothesis (H 0 ), we assumed that the mean difference (μ d ) is 0. The hypothesis was rejected as there was a significant difference between the outcomes of Trail-1 (M=92.8, SD=1.7) and Trail-2 (M=96, SD=1.29) conditions; (t(5)=-3.23, p=0.0091). The graph shown in Fig. 15 and Fig. 16 indicate the vivid decrease in errors.

C. Subjective Analysis
To analyze the response of the participants, a questionnaire was presented to the users at the end of the evaluation session. The questionnaire was to measure the three factors; Ease of use, Fatigue and Suitability of the technique in VEs. The post-assessment questionnaire is shown in Table II. As a whole, the system was easy to use.

2
No fatigue, stress or strain was felt during or after performing the interaction tasks 3 The technique is suitable for VR/AR applications.
The percentage of users' response to the three factors is shown in Fig. 17.

VI. Comparative Study
Following the interaction standards presented by various researchers [57][58][59], the proposed approach is systematically compared with the recent state-of-the-art interaction techniques. Based on the available information (acquired from the literature), a score ('1') is assigned to a technique if a particular interaction rule (R i ) is followed. The final score is calculated as, (14) Where N is the total number of the standard rules of 3D interactions (see Table III). Interactive support R 2 Reliable sensory feedback R 3 Simple and inexpensive R 4 Useable while sitting R 5 Wireless connection A total of eight state-of-the-art interaction techniques (including the proposed approach) were selected for comparative analysis. Details of the techniques with the possible challenge(s) are presented in Table IV. After extracting information from the research works about the five standards, the final score was computed, see Table V.

VII. Applicability in Games
The game technology is shifting from 2D graphics to realistic 3D VR. Recent research works have proved that gesture-based games are suitable for games [67] [68]. The free hand and/or fingers gestures have promising potential not only in education, training and medical application [69][70][71] but also in the contemporary VR games [67]. It has been proved that the use of gestures improves overall engagement of the gamers [72] [73]. To make VR games interesting and intuitive, a number of gesture-based techniques have been proposed. The technique proposed by [67] and [68] are based on Leap motion and Nintendo Wii controllers. However such systems suffer from the limited work space of the controllers and training of gestures [68].
The proposed technique is applicable in the games where the gamer needs to select and/or controls actions of an avatar/character in the game environment. Unlike other gesture-based systems [74], the proposed technique neither keeps feature sets [75][76][77] nor needs searching and classification [78][79][80], hence ensure speed in playing games. To properly assess applicability of the proposed technique in the contemporary VR games we selected the three well-liked games; see Fig. 18. Skyfall is a simple game created using the physics engine; planck. js [81]. There are three types of balls in the game that fall from the top of the screen in a random order. White and green balls have worth +10 points while red balls are of -10 points. Players need to earn more points by moving a paddle to catch the good balls (white and green balls). However, at the same time a gamer should avoid bad balls (red balls). The GIFT technique is applicable to control the paddle by the movements of the fingers.
Similarly, the Beads Puzzle [82] is a ball-shooter game where a gamer has to fire a ball at a time. Direction of the cannon is controlled by the mouse movements. A fire is made by the mouse down event. The game can be played with the proposed technique more easily than with a mouse.
Unity's Alien Invasion [83] is another popular game where a player has to defend a city from aliens. The player needs to control a character that is able to collect bombs and fire. With the proposed technique, a gamer can more efficiently control actions of the character. One suitable configuration of the game with the technique is 1) to control movements of the character with the dynamic positions of the fingers.
2) To use the Pinch-gesture for collecting the bombs and 3) to use the Cross-gesture for firing the bullets. An additional research is required to assess applicability of the technique in HMD based VR games.

VIII. Conclusion and Future Work
To cope with the pace of VR developments in various fields, simple and natural interfaces are in need. With this contribution, we proposed a novel gesture-based interaction technique which needs no expensive device other than an ordinary camera and pieces of paper. Simple and intuitive gestures are used to ensure naturalism while performing the basic interaction; navigation, selection, scaling, rotation and translation. Experimental results show that the proposed approach has reliable recognition and accuracy rates. The system neither needs training of images nor use any feature extraction, hence guarantying fast preprocessing. The proposed system is feasible and well suitable in a wide spectrum of HCI including virtual prototyping, 3D gaming, robotics, industrial architecture and simulation. The work also covers the smooth integration of image processing and VE and can be used in the designing of a realistic VR application.