If you want to cut down on the spam you receive, obfuscate your email address. That’s the recommendation of a Longwood University computer science professor who has studied how spammers get your address.

Dr. Robert Marmorstein has since 2008 conducted a research project with several undergraduate students that looks into how email addresses are collected by those who send unsolicited commercial emails. While most research has focused on filtering and other server-side techniques, this effort is unusual in that it targets what is called "address harvesting" behavior. Also unusual is that the study has looked at the role of obfuscating email addresses in ways that reduce the probability they will be harvested and received spam.

"We’re trying to stop spam at the source by making it harder for spammers to harvest the address in the first place," said Marmorstein, assistant professor of computer science. "A few other people, including Project Honey Pot at Stanford, are looking at this, but not many. Project Honey Pot captures statistics about the mechanics of address harvesting but does not consider all types of obfuscation.

"The overriding question in this project is ‘How do spammers get your email address?’ I launched this project after I started thinking about spam. I had seen research on spam filters but hadn’t seen much on how spammers get your email address. Some of the work was out of date or didn’t answer my questions, so I started my own research. This is fun work."

Marmorstein tracks spam messages sent to five email addresses—a legitimate address and four obfuscated addresses—that have been distributed in groups of five to public websites. Student researchers, each working for a semester, examine various aspects of how they are gathered.

Two of the obfuscated techniques have not been well studied, Marmortein said. One approach inserts characters into a legitimate address, the other writes the address backward. "One question we wanted to answer is ‘Does it help to obfuscate your address?’ The answer is ‘It does,’" said Marmorstein. "The obfuscation has worked amazingly well—much better than we expected."

Some 793 of the 925 spam messages so far have been harvested from the legitimate addresses, which doesn’t surprise Marmorstein. "The spammers are going for the low-hanging fruit," he said. "The other way is more work—they have to first figure out if it’s an obfuscated address, then they have to de-obfuscate the address."

The Longwood website has received the largest amount of spam—more than 500 messages, with 102 at the site with the second highest number of spam messages. Some 16 sites are actively receiving spam. At one time, the five addresses were posted on more than 250 sites.

"There are three ways to harvest email addresses—from public websites, which is the most common way; from unscrupulous companies selling personal information; and from a dictionary or ‘rainbow’ attack in which they just guess an email address," said Marmorstein. "We have found no evidence of number two and some but not much of number three. We’re convinced it’s primarily number one. Project Honey Pot has found the same thing."

An estimated 88 percent of all email traffic worldwide—94 billion messages daily—is spam, most of which is illegal, said an article in the summer 2012 issue of the Journal of Economic Perspectives. Spam costs U.S. society an estimated $20 billion annually, the article said.

"Spam is a real problem," said Marmorstein. "Spam slows a network down—it’s like clogging the drain of a pipe. We’d like to have better spam filters, but it’s like a constant arms race between spammers and spam filters. If we can understand better the differences between spam messages and legitimate email, we can improve the filters. It’s tricky, and spammers get more clever all the time."

Sophomore computer science majors Laurence Kelly and Cameron Rinaldi worked on the project during the fall 2012 semester. "They wrote programs for data mining, which is extracting the useful information out of a large body of potentially useless information," said Marmorstein. "They had to write programs so that a non-technical person could collect the data."

"We looked at patterns—at what makes an address prime to be harvested. Plain text [unobfuscated] addresses are what they look for the most," said Kelly. "We had to learn how to write what are called Unix shell scripts, which was difficult at first."

Rinaldi said he and Kelly "took information that’s already on the server and made it more accessible to others working on the project. We took a file that’s 150,000 lines and made it more readable."

Four other students—Andrew Armes, now a senior; Claire Keith ’12; Daniel Oppecker ’10 and Damian Bailey ’09—also have worked on the project. "I pick talented students who I know will do good work for me, and I match the project to the student," said Marmorstein.

The research has confirmed that, as expected, email addresses are harvested less frequently from less popular sites than those that are well trafficked. Marmorstein hopes the project eventually will shed light on other aspects of harvesting behavior, including how long it takes for spam to show up on a website and classifying the different categories of spam messages.

Leave a Comment