Empathy for the Attacker

Written by

It was 2014. I was finally nearing the end of my undergraduate computer science coursework, which meant the oft-dreaded senior project. My university was notorious for softballing the senior project assignments, requiring a very loose academic write-up and oral presentation of a very open-ended project assignment, watched over by a usually unenthusiastic advisor board and equally unenthusiastic graduates-to-be.

This was… hardly a fraught, terse and professional thesis defense, and I’d been looking forward to this project for quite some time. Most of the coursework over the past 4 years had been incredibly mundane and hardly real-world applicable, so I was looking forward to being able to show off my work in a more practical manner.

I’d fallen (literally) one hour short of a degree in Mandarin Chinese, a slight that had left a bitter taste in my mouth considering how much work I’d put into the program. The program itself was more or less designed to put students in that position if they chose not to travel abroad for a year after graduation, but that’s a rant for another time… The point was, I wanted to some of that hard work I’d put into my language studies to do something cool, something of merit, in my senior project.

So, the Hacking the Great Firewall project was born. My fellow classmates and I had put a fair bit of use into a specific Chinese social media site during my Mandarin coursework, and it seemed a good place to apply my language and development skills. Zhihu is essentially a social media site combined with Yahoo Answers, a place where conversations began with questions on all sorts of topics, including social issues, computer science, anime series and art. As with all things, I’d noticed a suspicious lack of conversation surrounding certain topics, and the discussion of censorship had come up from time to time throughout my coursework, usually to the chagrin of the professors, who often tried to keep things relatively non-political in the classroom.

My plan was, at first, pretty simple: develop a platform that can scrape, search and spider the Zhihu social media sites, starting from “root” nodes of topics I thought would likely be censored, as well as a sort of “control group” of likely benign, mostly uncensored topics. Root nodes of likely controversial topics included the politics sections, the section on foreign affairs and American politics, the section on social issues, etc. Control group topics were ones about popular Chinese soap operas and sports events and such.

The output of the project: a list of censored topics, such as the high rate of suicides in academic settings in China, political subjects surrounding freedom of speech and foreign affairs topics.

This blog post, though, isn’t about that output… It’s about the time I spent developing the scrapers themselves, and the empathy it gave me toward an unlikely demographic in my current work: foreign hackers and online criminal groups.

An introduction to scraping

I’m currently developing a course on scraper and spider development, and one of the topics I’m covering is common problems in scraper development. Most specifically, certain kinds of scrapers can be incredibly fragile: one simple change to a site can break the scraper.

Web page scrapers are relatively simple: they pull down raw pages of HTML and parse them, looking for the data that interests the developer. If I have a simple web page that just has an H1 tag and an H3 tag that contains, say, stock price data, I make a request to the page and pull out whatever data is in the H3 tag. The problem, then, comes if the developers decide to change the H3 to an H1 tag. If I’m looking for a tag with a certain class name, say “stonkprice”, and the developer changes the class name to “stonkPrice” with a capital letter, my scraper will possibly break since it doesn’t find what it’s looking for.

This can be exceedingly frustrating, even without the developer of the site changing anything. You have to develop a parser to be more resilient, which means programming error checks and constantly playing whack-a-mole with edge cases. In some circumstances, the site maintainer may change larger site design layouts, changing the web design schema to put certain pages in different places than what they had initially, leading to… more whack-a-mole.

Sometimes, though, you get to deal with the most fun (read: frustrating) site administrators that are actively trying to keep you from scraping their site. This is what I dealt with in Zhihu, and they had some interesting ways to make my life more difficult. They created, whether automatically or manually, a series of hurdles as my days scraping the site turned into weeks: CAPTCHAs, random timeouts, user-agent detection and blocking, locking random content behind user registration requirements, IP blocking, etc. I had to spend more time building in functions to create random timeouts and user-agents, scrape sites that offered free, open web proxies, creating new user accounts that would just get banned in a few days, and finding ways around the always annoying CAPTCHAs. I could have written an entire paper just on ways to evade scraper detection, but my university advisors were already getting somewhat nervous about the fact that I was majorly pissing off site admins on the other side of the world, so I kept those details relatively scarce.

I had to constantly change the design of my scrapers to evade detection and mitigation. It was constant work, and I was worried that the next “fix” that would come from Zhihu’s site administrators would be the silver bullet that would finally kill my project. I can somewhat proudly say that the silver bullet never came. I would sit up late at night running a multi-threaded scraper on my laptop in my apartment bedroom, watching for errors like a hawk. They came sometimes in the middle of the night (many times during working hours in Beijing) and I would spend the next hours of the morning building fixes and documenting the newest detection and mitigation methods.

The paper and presentation were met with a pretty high amount of interest, considering my talk was towards the latter end of the last day of presentations, to a crowd of seniors who just wanted it all to be over with, and advisors who wanted to begin their summer just as bad as the students around them. The feedback was enthusiastic and positive.

Empathy toward the attacker

I think back to those days building out detection evasion methods and mitigation workarounds fondly. It was a blast, somewhat thrilling knowing that I was pissing off people who were stomping all over freedom of speech online. One lesson I would end up learning much later, just last week, honestly, was empathy toward attackers who spend quite a bit of time in shoes very similar to mine.

I was reading over some information security blogs to wrap my head around the enigma that is Russian operations in cyberspace when I had an odd thought:

It must be incredibly frustrating for the attacker when they see a new blog post come out!

I’ve had enough development experience to know that even the most rudimentary malware takes hours and hours to develop. Malware that targets niche machines like ICS/SCADA systems, or even more “basic” malware targeting normal user or enterprise machines, can take months or years to develop, test, and maintain. Operators spend hours upon hours, building and testing detection and mitigation evasion mechanisms, error handling, and a wide array of feature sets in their malware to make it work as intended. They test it on their own infrastructure or against small-fry targets to ensure their tools work and work well. It’s a labor of love to develop anything, especially something as intricate as a highly-functional, modular trojan…

Then comes the cat and mouse game with investigators and network administrators. They often peer over their shoulders when they finally land boots-on-network, constantly watching for an overly paranoid network admin or help desk clerk moonlighting as an infosec professional. At any moment, their screens could go dark, their shells could hang and the precious data, the honey flowing from their precariously positioned honeycombs, run dry.

Then, some jerk decides to write a blog, and the whole thing could be burned in an instant!

When those blogs drop, when those Tweets flow about the latest discovery on VirusTotal, they could potentially burn years, certainly months, of often laborious effort on the part of the developers and operators. There is quite a bit of talk from many in the information security space about “imposing cost” on the operators. The cost is often described monetarily, or in terms of wasted operational efforts, or just in making it more difficult to maintain access and exploit breaches. The cost that I hadn’t thought enough about, though, was the emotional cost. I know how difficult it is to spend hours and hours developing something, only for the silver bullet to be launched by site administrators in the form of CloudFlare protection or major site changes. It can be infuriating, especially after playing cat-and-mouse for hours and hours on end, writing updates and bug fixes, constantly checking logs to find streams of errors flashing red at you from the terminal.

It’s obviously forefront in our minds, the constant cat-and-mouse whack-a-mole game that is “blue team” defense of networks and information assets. Many of us, as defenders or threat intelligence experts, face burnout in our careers as the treadmill of almost endless threats to our safety and security buffet us like an endless monsoon of gale-force winds. This week, though, I pondered the endless surge of discoveries, detections and mitigations that plague our enemies on the other side of the aisle, the attackers that spend their early mornings and late nights developing the next new “sexy” or mundane threat to lob at our networks, hoping for at least initial success and some exploitable access before their access is burned and they start anew, with a blank IDE screen and a new target set.

So, if you’re an operator out there and some marketing team just burned your op with a blog or a researcher killed your trojan with a tweet, just know that I understand the turmoil. I’m going to keep working against you, but know that there’s a part of me that truly does empathize with that frustration.

—
I’m not active on Twitter anymore, for reasons I talk about in this blog post, but you can follow me on my Twitter page for new blog posts and announcements.

If you’re interested in the scraper class I’m developing, you can find out more on my GitHub page that will host some of the code and details about the class or my YouTube channel where I’ll post about updates for the class and the announcement when it’s done.

September 15, 2021

in:

Technical, Uncategorized

Tagged:

data science, information security, infosec, scraping

Share on: