Sorting the wheat from the chaff on Stack Overflow

By the SMU Corporate Communications team

By analysing tags and users’ answers on other developer platforms, one can better predict secure code than upvotes and reputation scores would suggest.

By Alvin Lee

SMU Office of Research & Tech Transfer – Scroll down just a little on the splash page of Stack Overflow (SO) and you will see the simple headline: For developers, by developers. Just below that is the simple description on why it is the most popular Q&A website for software developers:

Stack Overflow is an open community for anyone that codes. We help you get answers to your toughest coding questions, share knowledge with your coworkers in private, and find your next dream job.

In essence, users post coding questions – often with code snippets – and other users answer these questions. Good questions and answers get “upvoted”, with each up vote adding 10 points to a user’s reputation, signalling his or her trustworthiness in the community.

But raw numbers do not paint a full picture of an answerer’s expertise.

“As reputation aggregates the user’s answers across topics and/or domains, they are not representative of the user’s expertise in a given domain. Users with a high reputation score might be regarded as trustworthy users when answering a question in a domain outside of their expertise,” says Ming Shan Hee, a third-year student at SMU’s School of Information Systems (SIS).

Together with David Lo, Associate Professor of Information Systems at SMU, and Roy Ka-Wei Lee, Assistant Professor in Computer Science at the University of Saskatchewan in Canada, Hee proposes the use of Contextual Expertise “to evaluate the user’s expertise using his relevant developer activities across developers’ platform, namely SO and GitHub (GH)”.

Providing the context

On Stack Overflow, a relevant developer’s activities can be viewed as providing an answer to a question. Questions on SO are often associated with “tags” that help with classifying what a question is about, e.g. “java”, “python”, “javascript” etc. Using these questions’ tags (the context), we can gauge the users’ (who answered those questions) relevant domain expertise. However, common tags such as these just mentioned do not suggest any deep understanding about the question’s domain, Hee explains, as opposed to uncommon tags such as “checksum” and “digital-signature” that represent security libraries and concepts.

Hence, “for the assignment of the tags’ weights for Contextual Expertise, we decreased the weight for common tags and increased the weight for uncommon tags,” Hee tells the Office of Research and Tech Transfer.

Additionally, it is likely that questions are associated with multiple tags belonging to the same domain. This signifies that tags with high co-occurrence usually carries some form of implicit insights into each other, e.g. a “hash” can often be seen with “md5” and “hash” is the over-arching domain encompassing “md5”. We termed these unseen tags with high co-occurrence for a given question as quasi-tags, Hee explains.

“We expanded our search queries for tags to include the identification and evaluation of the users’ relevant expertise in quasi-tags,” Hee explains.

Working with Security Dataset

Hee used Fischer et al.’s previous research on the security vulnerabilities identification of security code snippets posted by Stack Overflow users as ground-truth. Expanding on Fischer et al.’s research, Hee matched 298 SO users – posting a total of 393 questions – to their account on GH. By using the Random Forest classifier, Hee sought to find out how accurate his approach would be in predicting a code snippet to be secure or otherwise.

“The goal for our research provides a trustworthiness score (level of confidence) for each answer so that the inexperienced user of Stack Overflow can be aware and be more wary about codes posted by non-SME (Subject Matter Expert),” Hee explains. “Random Forest is a machine learning method; in our study, it is used to compute a confidence probability on how certain it thinks a piece of code is secure or insecure, based on the features thrown in e.g. reputation, contextual expertise.

“Generally, we will want Random Forest to give us confidence above 0.5 for secure answers and confidence below 0.5 for insecure answers,” he adds.

Results

The results are based on three metrics: precision, recall, and F-measure. The Contextual Expertise approach scored 0.6116 for precision, which means “out of all the answers that our approach predicted as secure, 61.16 percent of them are actually secure (meaning that some of the answers predicted as secure could actually be insecure),” Hee elaborates. The precision for the reputation approach by upvoting as it now stands on SO scored 0.4880.

For recall, which “represents out of all the secure answers, how many of them are predicted correctly,” Contextual Expertise scored 0.5987 compared to 0.4899 for simple reputation. For F-measure, which is the harmonic mean between precision and recall, it was 0.6007 vs. 0.4868 in favour of the Contextual Expertise approach.

In other words, Contextual Expertise was substantially more reliable in gauging an answerer’s expertise in a specific domain. Hee summarises the research as such:

“Based on our findings, we can see that reputation, an aggregated measure used in SO, have its limitation in predicting an answerer’s trustworthiness for a particular post. To improve on the task of predicting the answerer’s trustworthiness, we are able to utilise and incorporate the answerer’s contextual knowledge, based on his historical answers, to determine the trustworthiness of an answer post.”

Back to Research@SMU Nov 2019 Issue


Image credit: Stock Image

Read more about our research