Which Similarity Metric to Use for Software Documents? A study on Information Retrieval based Software Engineering Tasks


Information Retrieval (IR) plays a key role in diverse SoftwareEngineering (SE) tasks. Similarity metric is the core component of IR techniques whose performance varies for different document types. Different SE tasks operate on different document artifacts like bug reports, software descriptions, source code, etc., that often containnon-standard domain-specific vocabulary. So, it is important tounderstand which similarity metrics work best for different SE documents. We analyze the performance of different similarity metrics on various SE documents including a diverse combination of textual (e.g., description, readme), code (e.g., source code, API, import package), and a mixture of text and code (e.g., bug reports) artifacts. We find that, in general, the context-aware IR models achieve better performance on textual artifacts. In contrast, simple keyword-based bag-of-words models perform better in code artifacts.

The 40th International Conference on Software Engineering