Which Similarity Metric to Use for Software Documents? A study on Information Retrieval based Software Engineering Tasks

Abstract

Information Retrieval (IR) plays a key role in diverse SoftwareEngineering (SE) tasks. Similarity metric is the core component of IR techniques whose performance varies for different document types. Different SE tasks operate on different document artifacts like bug reports, software descriptions, source code, etc., that often containnon-standard domain-specific vocabulary. So, it is important tounderstand which similarity metrics work best for different SE documents. We analyze the performance of different similarity metrics on various SE documents including a diverse combination of textual (e.g., description, readme), code (e.g., source code, API, import package), and a mixture of text and code (e.g., bug reports) artifacts. We find that, in general, the context-aware IR models achieve better performance on textual artifacts. In contrast, simple keyword-based bag-of-words models perform better in code artifacts.

Publication
The 40th International Conference on Software Engineering
Date
Links

Poster PDF