This paper provides new research perspectives in the field of multimodal comprehension
(auditory crossing visual information) by using immersion and incorporating eye tracking in a
virtual reality environment. The objective is to investigate the influence of a change in narrative
perspective (point of view) during the activation of a mental model underlying comprehension
between visual and auditory modalities. Twenty-eight participants, equipped with a headset
SMIHMDHTC eye-tracking 250 Hz watched 16 visual scenes in virtual reality accompanied by
their corresponding auditory narration. The change in perspective may occur either in the visual
scenes or in listening. Mean fixations durations on typical objects of the visual scenes (Area of
Interest) that were related to the perspective shift were analyzed as well as the free recall of
narratives. We split each scene into three periods according to different parts of the narration
(Before, Target, After), the target was where a shift in perspective could occur. Results shown
that when a visual change of perspective occurred, mean fixation duration was shorter
(compared to no change) for both Target and After. However, when auditory change of
perspective occurred, no difference was found on Target, although during After, mean fixation
duration was longer (compared to no change). In the context of 3D video visualization, it seems
that auditory processing prevails over visual processing of verbal information: The visual
change of perspective induces less visual processing of the Area of Interest (AOIs) included in
the visual scene, but the auditory change in perspective leads to increased visual processing of
the visual scene. Moreover, the analysis showed higher recall of information (verbatim and
paraphrase) when an auditory change in perspective was coupled with no visual change of
perspective. Thus, our results indicate a more effective integration of information when there is
an inconsistency between the narration heard and viewed. A change in perspective, instead of
creating comprehension and integration difficulties, seems to effectively raise the attention and
induce a shorter visual inspection. These results are discussed in the context of cross-modal
comprehension.