PrefixMap

Neural Machine Translation (NMT) is the current trend approach in Natural Language Processing (NLP) to solve the problem of automatically inferring the content of target language given the source language. The ability of NMT is to learn deep knowledge inside lan- guages by deep learning approaches. However, prior works show that NMT has its own drawbacks in NLP and in some research problems of Software Engineering (SE). In this work, we provide a hypothesis that SE corpus has inherent characteristics that NMT will confront challenges compared to the state-of-the-art translation engine based on Statistical Machine Translation. We introduce a problem which is significant in SE and has characteristics that challenges the abil- ity of NMT to learn correct sequences, called Prefix Mapping. We implement and optimize the original SMT and NMT to mitigate those challenges. By the evaluation, we show that SMT outperforms NMT for this research problem, which provides potential directions to optimize the current NMT engines for specific classes of parallel corpus. By achieving the accuracy from 65% to 90% for code tokens generation of 1000 Github code corpus, we show the potential of using MT for code completion at token level.

The code and data of PrefixMap are available at here.