EVEN-VE: Eyes Visibility Based Egocentric Navigation for Virtual Environments

when both the eyes are opened. At the closing state of eyes, threshold times for forward and/or backward navigation are checked. Keeping both eyes closed for extended time greater than that of normal flickering, activates forward navigation. Upholding the closing state of eyes for another equal interval leads to activation of reverse navigation. Panning is performed based on eyes position in dynamic scanned frames. As eyes’ positions change with tilting of head, the 2DoF head’s movement; Rolling and Pitching , are traced to pan along z-axis or x-axis respectively. The case-study project EWI was designed using the open libraries to implement and evaluate the system. Based on algorithm of the technique, EWI supports multiple speed sectors for navigation. The system was twice evaluated in two separate sessions. The first session was to evaluate accuracy and learning effect of the system. In second evaluation session, the system performance was assessed in different navigation sectors using two


I. Introduction
N avigation is the user's locomotion inside a virtual environment to reach to and inspect an object of interest from close proximity. The two broad metaphors of navigation are Egocentric and Exocentric. Where the former is analogous to moving the world in reference to user's input, the latter is moving the user's viewpoint inside the virtual world. Navigation often becomes a preliminary step to perform other interactions like selection, scaling and rotation. Furthermore, a wide spread 3D scene cannot be viewed from a single static look of virtual camera. Navigation is, therefore, important to explore different parts and portion of a synthetic world. Though, a number of navigation techniques have been proposed based on mouse/keyboard, their traditional menu-commands based interface fails to bring intuition and naturalism in VE. Low-cost vision based tracking on the other hand provides an applicable platform to devise a flexible interface for Human Computer Interactions (HCI). Enormous Gestures based techniques in the literature of 3D interactions provides estimable degree of realism. The latency and fatigue involved in posing gestures, however, are the challenges yet to be covered. Particularly, the time lapse in posing of gesture, adds in imbalance bandwidth between human-computer dialogues which in turns badly affects immersion.
As eyes-based interaction is faster than speech and gestures based interactions [1] therefore, interaction via eyes postures can reduce the speed mismatch. This paper is an attempt to present a convenient and cost-effective eyes' wavering based interaction. The technique; Eyes Visibility based Egocentric Navigation in Virtual Environment (EVEN-VE), follows gestures of eyes which are speedier than any other gesture made by any other part of the body. Though worthy research has been carried out on iris recognition but iris varies from person to person, therefore it cannot be used for interaction [2]. The proposed system merely considers wavering of eyes hence processing time is not wasted in tracing user's specific features. Only flickering of eyes and tilting of head are traced by the system to interact with the designed 3D virtual world. Navigation is activated by a slight flickering-gesture of both the eyes while Panning is performed by tilting of head along the look-vector. To make user aware about the beginning of any action, all interactions are initiated when both the eyes are opened. At the closing state of eyes, threshold times for forward and/or backward navigation are checked. Keeping both eyes closed for extended time greater than that of normal flickering, activates forward navigation. Upholding the closing state of eyes for another equal interval leads to activation of reverse navigation. Panning is performed based on eyes position in dynamic scanned frames. As eyes' positions change with tilting of head, the 2DoF head's movement; Rolling and Pitching, are traced to pan along z-axis or x-axis respectively. The case-study project EWI was designed using the open libraries to implement and evaluate the system. Based on algorithm of the technique, EWI supports multiple speed sectors for navigation. The system was twice evaluated in two separate sessions. The first session was to evaluate accuracy and learning effect of the system. In second evaluation session, the system performance was assessed in different navigation sectors using two different cameras. Results of the second evaluation session proved that with a high quality camera both speed and accuracy of the system can be improved. The proposed technique is applicable for all the basic navigation tasks; exploration, searching and inspection [3]. Moreover, due to the least muscular strive and strain involved in wavering of eyes and tilting of head, less fatigue is guaranteed.
The paper is organized into 7 sections. Related work for Navigational interaction is discussed in Section II. Section III elaborates the algorithm of the technique in detail, section IV is about implementation and evaluation details. In Section V, the system is compared with state-of-the-art navigation techniques. Section VI discusses applicability of the technique in VR. Conclusion and future work is discussed in the last section.

II. Literature Review
No doubt, the ceaseless progress in processing and storage of computer systems has made possible the emergence of virtual reality applications in every field. To make possible convenient dialogue with such applications, realistic way of interactions is required. Being the most commonly used interaction task, navigation is focused by most of today's research works to make exploration feasible and flexible. Most of the navigation techniques in the literature of virtual reality are either based on mouse/keyboard or need Head-Mounted Display (HMD) [4]. Where the former techniques lack immersion, the techniques based on HMD always remain a second choice because of the high cost. The bimanual Multi-Finger gestural navigation of [5] works on a constrained tabletop surface and is not suited for hand-held devices. Another hand gesture based navigation; NuNav3D [6] preliminarily needs whole body pose estimation. This makes the system not only difficult to use but also 79% slower than Joypad based navigation [6]. The technique of lee et al. [7] consumes a large amount of processing for finger action recognition as the system ought to pass from three hefty stages; skin-color detection, k-cosine based angle detection and contour's analysis each for finger's state, position and direction. Furthermore, due to variable finger's thickness, accuracy of the system varies from user to user.
With the Drag'n Go [8] the position of screen's cursor casts ray to the target for navigation. The technique is applicable only for navigation in large empty space while askew navigation is not supported. In another related approach [9] dragging mouse in a specific direction shifts the look-at vector of virtual camera. Thus for a lengthy travel the practice is needed to be repeated again and again. Moreover, in the way if the ray collides with an object, navigation is halted and the collided object is mistakenly inspected.
In the head-directed navigation [10], speed and direction are calculated from the pose and direction of the user's head. Though intuition of the system is commendable but unintentional and casual head movements are fallaciously treated as navigation commands. The FmF (Follow my Finger) [11] presents MC (Management Cabin) model for navigation. The system projects 3D view of a virtual world on a 2D table-top device but suffers from disorientation problem.
Trackers based gestures recognition [12] and [13] are the noteworthy navigation techniques which work on natural gestures of hands. Both of the systems treat virtual scene as big object that can be grabbed and moved with hands. The techniques are interesting but remain a rare option due to the hand-worn overload.
The walking in place of [14] correlates pace of the natural walking with navigation speed in synthetic environment. The approach cannot be used while setting on chair. Chest expansion and Tilting of spine are treated as commands for navigation in [15]. The system needs a complex setup of sensors implementable only inside a controlled environment. Similarly, the foot-based interface for navigation suggested in [16] requires a cumbersome setup of waist-based magnetic trackers with conveyer belt for navigation.
Fiducial markers have also been effectively used for navigation. In the DeskCube system [17] different markers are glued on different faces of a cube. The system works fine when the cube is still but fails to cope with occlusion and blurring while moving the cube. Similarly navigation system based on QR codes necessitates the placement of navigation device closed to the marker [18].
Using the leap motion based system [19] for navigation, user needs to activate commands via buttons within a limited space. The fingers based locomotion; FWIP (Finger Walk In Place) [20] necessitates touch of the fingers on display screen for navigation. The system is applicable only for mobile based VEs. The pointing technique of [21] uses two fingers; one for viewing and the other for direction. Though it successfully avoids the mistakes of gaze-directed navigation but speed controlling is its main challenge.
As eyes tracking based interfaces ensure quick and effortless interactions [22], therefore eyes based interaction is becoming prominent in the domain of VR. The system of [23] traces gaze direction after pushing a button to reach to an object. The system needs HTC Vive with a Tobii Eye-tracker and is therefore applicable inside a lab environment. With EyeScout system [24] eye trackers are mounted over a rail system. The trackers are made aligned with user's lateral movement by a computational method. Based on 9-point polynomial algorithm, the HoloLens based system of [25] maps eyes position to a plane in which calibration points are displayed. The system needs a constrained setup containing wireless camera, eye tracker and IR (infrared) filter. The lazyNav system [26] supports other interactions besides navigation using HMD and trackers. Extensive research in the literature has been made to enhance efficiency of eyes tracking. The algorithm proposed by [27] for measuring focal distance in immersive CAVE (Cave Automated Virtual Environment) is to improve precision of eyes tracking. Moreover, as eyes-tracking based interaction is supposed to be the VR's next frontier [28], models for accurate detection of eyes using next generation devices have also been suggested [29].

III. EVEN-VE: The Proposed System
Interactions via gestures of eyes are becoming an effective alternative due to the availability of portable head-mounted eye trackers. Nevertheless, the clumsy setup of wires, as shown in Fig. 1 restricts the use of such systems only inside the premises of a lab. Aim behind this research was to introduce a cheaper and easy-to-use interaction technique. The proposed system tracks eyes' positions and orientations through an ordinary camera without the use of any trackers or bunch graphing [30]. A slightly longer than the normal flickering of eyes is assumed to start forward-navigation. If navigation is already activated then the same gesture is used to halt it. Backward navigation is activated by keep closing both the eyes for a longer time than the forward navigation. After detecting both the eyes, Mid-Point (MP) is calculated to get central point between the eyes [31]. MP is calculated from the horizontal and vertical positions of the eyes as, Where XEye are point variables representing mid position of the eyes (eyeball with sclera). X is representing both Left (eye) and Right (eye).
A straight virtual ray starting from MP is supposed to represent user's direction. The ray's hit position is represented by a pointer in the designed 3D world, see Fig. 2. Before starting navigation, the pointer can be used to set position of the virtual camera. By default, navigation is performed straight inside the look vector. Using 2D positions of the eyes in the initial five frames, Eyes Initial Average Area (EIAA) and Normal Speed Zone (NSZ) are defined. NSZ is a limited area where navigation with normal speed is performed whereas EIAA is calculated as, Where X_AA is calculated from eyes' area in the initial five frames as, Area of an eye, as shown in Fig. 3, is calculated from the length and width of eye (eyeball including iris and sclera) as, Navigation speed can be increased or decreased based on areas of the eyes. Moving head forward along z-axis results into increase in areas of eyes which in turn makes navigation speedier and vice versa.
Panning is performed by gentle tilts of head towards left or right along z-axis (Rolling). As conceivable, a slight bending towards left activates left panning whereas right panning is accomplished by bending head towards right. Up panning (flying) and Down panning (landing) are performed by Pitching gesture; raising and bowing head along y-axis respectively. With both Rolling and Pitching gestures of head, positions of eyes change. Calculating the dynamic positions of eyes against MP, horizontal or vertical panning is performed.

A. The System Architecture
The system designed for the implementation starts with a virtual world at front end with different 3D objects. At the backend, a region of interest (ROI) based on skin color is extracted from the scanned image. At the detection of eyes, MP is calculated and coordinates mapping is performed to set virtual camera position in the environment. As long as eyes are detected, the virtual pointer can move freely with the movement of eyes. If the eyes closing time extends the normal blinking duration, navigation is activated. Tilting of head, alters the eyes' vertical positions from the central MP, activates panning. Schematic of the proposed system is shown in Fig. 4.

B. Face Segmentation
In order to reduce processing cost and to avoid possibility of false detection first a ROI image (ROI-Image) is extracted. ROI-Image is supposed to contain both the eyes. This segmentation is carried out on the basis of skin color. As the YCbCr space thresholding provides best skin color segmentation [32], hence the YCbCr model is used for face extraction. Each scanned RGB frame is thus converted to YCbCr space (Frame_YCbCr) as, (6) From the binary image of Frame_YCbCr, see Fig. 5, ROI-Image is extracted with rows ‹m› and columns ‹n› using our designed algorithm [33] as, Where Lm, Rm and Dm represents Left-most, Right-most and Downmost skin pixels.

Lm Rm
Dm

C. Eyes Detection
The most accurate cascade classifier for eyes [34] is used for the detection and tracking of eyes. The algorithm detects eyes rapidly in live video and is more accurate than LBP [35]. With Haar-cascade, first the system is trained by several numbers of positive images and negative images and then rectangular features are selected by Adaboost. Each time window of the eyes are moved over the scanned image and for each subsection, the Haar-like features are calculated. Greater the matching of positive features, higher will be the eyes recognition.

D. Coordinates Mapping
The notable challenge in implementation of the technique was to locate and move pointer of the virtual environment through pixels position of MP. The image frame of OpenCV starts from top left (0,0) while in OpenGL, (0,0) lies at the center of clipping area. To harmonize the dissimilar coordinate systems, we devise our four mapping functions m 1 , m 2 , m 3 and m 4 [36], as described in equations (8), (9), (10) and (11). The image frame is virtually split into four regions R 1 to R 4 as shown in Fig.6, where mapping for a region R n is made by the corresponding m n taking x and y of a pixel of R n as independent variables.

E. Navigation
Navigation is the moving of a virtual camera in 3D VE towards or away from a look-at point. In the proposed system, navigation is activated/deactivated by a single prolonged blink of eyes. The normal blinking rate is approximately 1/3 of a second in which eyes remain closed for about 300-350 milliseconds. To avoid detection of normal unintentional blinking the closed state of eyes for less than 400 milliseconds are omitted. When navigation state is set enabled, user is informed by a beep and actual movement is started when eyes are opened. Keeping eyes closed after the first beep will activate reverse navigation. Backward navigation is distinguished by a lengthy beep sound. Textual information about each state is also displayed on the top of the virtual environment. By default navigation is performed in a straight line but user can change direction of navigation dynamically through eyes movement. The viewable scene is divided into 180 degrees as shown in Fig. 7. Moving the pointer thirty pixels towards right will result into a decrease of thirty degrees while moving thirty pixels towards left will make an increase of thirty degrees. The 30-pixels to 30-degress ratio can be decreased to divide the scene into further navigation paths. The virtual division of scene based on the position of the pointer ( a 3D wire cone) is shown in Fig. 7.

Navigation Speed
One of the main problems with VE navigation is speed control, the absence of which may either lead to disorientation or slow velocity [37]. Based on the quality of camera and distance of eye from it, various finite speed sectors 'Sn' can be defined. Initially, the system was implemented and tested with the default three sectors, NSZ, one ahead and one behind the NSZ, see Fig. 8 and Fig. 9. Entrance into a particular sector Sn is determined from the Eyes Average Dynamic Area (EADA). Dynamic Areas (DA) of the eyes are calculated on the fly as,  With forward head's movement ahead of NSZ, EADA increases and hence speed is increased. Similarly, by moving head behind the NSZ, EADA decreases which results into decrease in speed. The '10' pixels extension in area is to avoid erroneous/unintentional increase or decrease. The pseudo-code for detecting variation in EADA and accordingly setting navigation speed is as follow,

EIAA=EADA
When the condition of forward or backward head's movement meets, current area of eyes (EADA) becomes previous area (EIAA); see second statement of the condition in the pseudo-code. With this, algorithm of the system supports '�' sectors beyond and behind the NSZ provided that eyes are detectable in a sector � | ∈ {1 ,2 ,3 , ., ., }. Eyes recognition and the extraction of area of eyes depend on the quality of camera and distance of user from camera, see Fig. 10.

F. Panning
Panning is the translation of the camera eye and the look-at points horizontally or vertically. To avoid the possibility of disorientation, panning is enabled only inside NSZ. Furthermore, panning is performed with a constant moderate speed, independent from navigation's speed. Horizontal Panning is performed by slightly bending the head over the z-axis (Rolling gestures of head). With rolling gesture of head, one of the eye's position on y-axis increases whi.le that of the other decreases. Similarly, with vertical movement of head (pitching), x and y positions of both the eyes are changed. The changes in positions of eyes are traced dynamically to perform up or down panning by changing y-coordinate of virtual camera. All this is measured against the vertical and horizontal position of MP as shown in Fig. 11, Fig. 12 and Fig. 13.
The pseudo-code for left and right panning is given as, Fig. 11. Eyes tracking and setting of MP between the eyes.

IV. System Implementation and Evaluation
The proposed system was implemented in a Visual Studio project; EWI (Eyes' Wavering based Interaction). The frontend virtual scene of the EWI was designed in OpenGL. Activation of the system is conditioned to visibility of both the eyes. User is constantly informed about activation of the system by the text "Activated" displayed in upper part of the scene. Similarly, about left and right panning user is informed by the "Turning" text at the respective side of the scene as shown in Fig. 14. The system was tested by sixteen male participants. Two trials were performed by each participant for each of the four predefined tasks.

A. Testing Environment
The 3D environment designed for the evaluation contained four routes as shown in Fig. 13 where a robot made of cubes represents the user's position in the environment. Symbolizing finishing point of the scene is marked by a board with text "Stop". Each of the routes leads to the one Stop-board. To immerse users and to have a perception of navigation and panning, the scene contained different 3D objects at different positions. Route-4: Up-Straight-Down pathway to have a flying effect to navigate over the bandstand. Although, this route is not rendered in the scene, to avoid complication, but user can use it. A 2D model of all the routes is shown in Fig. 15.

B. Evaluation Procedure
Each one of the participants was briefed about the environment and the interactions. They were guided by a demonstration about how to properly perform navigation and panning. Before actual trials, pretrials were performed by each participant both for navigation and panning. For better detection of eyes, those wearing spectacles (two of the participants) were requested to remove glasses. Users were guided to hit the 'r' keyboard key to restart a current incomplete task or to perform a new task. At the completion of a trial, Enter key was to be pressed to reset the entire system of EWI for a new participant. False detection and wrong turns were counted as errors. All the trials were performed inside the university IT lab in a moderate lighting condition.

C. Evaluation Tasks
Participants were asked to perform the following four interaction tasks in the designed 3D environment.
Task-1: Touching the Stop-board using Route-1 and then back to starting point following the same route.
Task-4: Touching the Stop-board using Route-4 and then back to starting point using the same route.

D. Results of the First Evaluation Session
Navigation-In is the forward while Navigation-Out is the backward movements inside the VE. Left/Right panning is turning, with shift of virtual camera towards the respective direction accordingly. Task-1 tests Navigation-In and Navigation-Out. Task-2 and Task-3 are to evaluate Navigation-In, Right panning and Left panning. Task-4 assesses Up panning, Down panning and Navigation. Missed detection or false detection of the system after posing the required gestures were counted as errors. With this setup, overall accuracy rate for all the 448 interactions, as shown in Table I, was 91% with average Standard Deviation (SD) 0.5. Mean of the %age accuracy with standard deviation of the interactions are shown in Fig. 16 This implies that the system performance can be further improved with high efficiency camera.

E. Learning Effect
The learning effect was measured from the errors occurrence rate. Paired two sample T-test was used to analyze differences in means of the two trails. With null hypothesis (H 0 ) we assumed that mean difference ( μ d ) is 0. The hypothesis was rejected as there was a significant difference between the outcomes of Trail-5 (M=34.5, 21.3) and Trail-6 (35.5,21.5) conditions; (t(5)=-3.8, p=0.0117). The graph indicating this vivid decrease in error is shown in Fig. 17.

F. Assessment of Navigation Speed
In a separate evaluation session, performance of the system was assessed for different navigation speed in different sectors. Twelve participants, mean age 29 (SD=4.9, Range=17) familiar with image based HCI, were invited to the session. Two cameras, different in efficiency; HP Webcam and Logitech USB webcam, were used in the evaluation process. Participants were guided to perform the following tasks in a single trial using Route-1.
Task-5: Touch the Stop-board as quickly as possible.
Task-6: Get back to the starting point as slowly as possible.
The evaluation session was arranged in two phases with a short break in the middle. During the first phase, ordinary camera was used for tracking while in second phase the high quality, Logitech external camera. To properly evaluate individual trial, code of the system was updated to display number of the Highest Achieved Sector (HAS) and to stop detection when a trial completed. HAS is the highest sector for navigation in either direction and is obtained by incrementing the previous sector by 1.  By using the high quality camera, navigation in second phase was mostly performed in higher sectors (increase in HAS). Hence, the obtained average completion time was reduced 45%, see Fig. 19. Fig. 19. Average of errors occurred, task completion time and HAS of the second phase trials.
Pearson's correlation (r) was used to measure the relationship between Errors and HAS. A moderate correlation was demonstrated while using ordinary camera (r=0.39), see Fig. 20. A downhill (negative) correlation was obtained by using the quality camera (r = -0.5), see Fig. 21. The diversity of 'r' values therefore, testify the fact that error rate could be reduced with the use of a high quality camera. To avoid disorientation, turnings are made with a constant moderate speed independent of navigation speed.

G. Subjective Analysis
A questionnaire was presented to the user at the end of evaluation session to measure the four factors; Fatigue, Naturalism, Suitability in VEs and Ease of Use. The post-assessment questionnaire is shown in Table II. The technique is suitable for VR/AR applications.
4 As a whole, the system was easy to use.
Percentage of user's response to the four factors are shown in Fig. 22.

V. Comparative Study
To systematically compare the approach with other techniques, we follow the standards presented by [38] and [39] for navigation in VE. Based on the information obtained from the literature, we assign a score ('1') to a technique (1-9) if a standard ( ) is obeyed. The final score is calculated as, (15) Where N is the total number of the standard rules for navigation, see Table III. Simple and inexpensive S 4 Support unrestricted flying S 5 Wireless/no connection Including the proposed technique, a total of nine state-of-the-art gesture based navigation techniques were selected for comparison. Details of the techniques including Speed control, Dynamic turning and Possible challenges are presented in Table IV. After extracting information from the research works about the five standards, final score was computed, see Table V.   [46] for navigation. Such systems restrict users to perform other 3D interactions with one hand. Keeping both hands free during navigation guarantees natural interactions [46]. Keeping in view this basic rule [38], goal behind this research was to keep user's hands free during navigation. By this way users would have the option to perform other interactions (selection, scaling and rotation) with any gesture. As the technique reserves no hand's gesture for navigation, hence can be easily mingled with any gesture based VR system. Furthermore, as no computation for pupil or iris analysis is performed, therefore the technique is suitable for current VR application where speed matters. Results of the subjective evaluation also suggest that the technique is suitable for VR application as 75% of the users picked the fact.

VII. Conclusion and Future Work
Navigation is often required in many 3D interactive virtual spaces. The emergence of virtual and augmented reality applications in different fields necessitates natural and simple way of navigation. With this contribution we propose a novel navigation technique which needs no extra device other than an ordinary camera. The intuitive eyes gestures for interaction not only ensure naturalism but also make it equally suitable for users with motor difficulties. Experimental results show that the proposed approach has reliable recognition and accuracy rates. Unlike other gesture based techniques [47], simple blinking of eyes are traced, hence the occlusion problems is minimized. Moreover, as the system remains indifferent to eyeglasses, therefore accuracy doesn't suffer from the size and color of eyeglasses. As the results suggest, errors due to false recognitions could possibly be reduced by using a high quality camera.
The proposed system is equally applicable in a wide spectrum of HCI including CAD, engineering, 3D gaming and simulation. In the emerging domain of Internet Of Things (IOT), the proposed system could be easily enhanced to interact with virtual object [48] and surgical navigation [49]. The work also covers the smooth integration of image processing and virtual environment which can lead to the design of more sophisticated virtual and augmented reality applications. This research work is a fraction of our aim of making possible interaction outside virtual reality labs. Although, we have succeeded for navigation, rotation and scaling are yet to be covered. In future, we are determined to enhance the system for rotation and scaling as well.