Abstract: The present invention provides a system and method for building a lexical analyzer that can scan multibyte character sets. The present invention factors regular expressions that contain multibyte characters, so that a single byte finite state automata can be constructed. In particular, the present invention provides a computer-based system and method for tokenizing a source program written in a programming language that is represented by both single byte values and two byte values. The present invention includes a mechanism for building a lexical analyzer that is configured to accept an input specification. The input specification typically includes a regular expression(s) and a corresponding associated action(s). The present invention also including a mechanism for factoring the regular expression(s), if the regular expression(s) contains at least one two byte character, into a regular expression(s) containing only single byte characters.