Abstract: Described herein are systems, devices, and methods for translating an utterance into text for display to a user. The approximate location of one or more potential speakers can be determined and a detected utterance can be assigned to one of the potential speakers based, at least in part, on a temporal relationship between the commencement of lip movement by one of the potential speakers and the reception of the utterance. The utterance can be converted to text and, if necessary, translated from a source language to a destination language. The converted text can then be displayed to the user in an augmented reality environment such that the user can intuitively appreciate to which of the potential speakers the converted text should be attributed.