GitHub’s Copilot comes with a coded list of 1,170 words to prevent the AI programming assistant from responding to input, or generating output, with offensive terms, while also keeping users safe from words like “Israel,” “Palestine,” “communist,” “liberal,” and “socialist,” according to new research.
Copilot was released as a limited technical preview in July in the hope it can serve as a more sophisticated version of source-code autocomplete, drawing on an OpenAI neural network called Codex to turn text prompts into functioning code and make suggestions based on existing code.
No one wants to end up as the subject of the next viral thread about AI gone awry
GitHub is aware that its clever software could offend, having perhaps absorbed parent company Microsoft’s chagrin at seeing its Tay chatbot manipulated to parrot hate speech.
“The technical preview includes filters to block offensive words and avoid synthesizing suggestions in sensitive contexts,” the company explains on its website. “Due to the pre-release nature of the underlying technology, GitHub Copilot may sometimes produce undesired outputs, including biased, discriminatory, abusive, or offensive outputs.”
But it doesn’t explain how it handles problematic input and output, other than asking users to report when they’ve been offended.
“There is definitely a growing awareness that abuse is something you need to consider when deploying a new technology,” said Brendan Dolan-Gavitt, assistant professor in the Computer Science and Engineering Department at NYU Tandon School of Engineering, in an email to IAIDL.
“I’m not a lawyer, but I don’t think this is being driven by regulation (though perhaps it’s motivated by a desire to avoid getting regulated). My sense is that aside from altruistic motives, no one wants to end up as the subject of the next viral thread about AI gone awry.”
Hashing the terms of hate
Dolan-Gavitt, who with colleagues identified Copilot’s habit of producing vulnerable suggestions, recently found that Copilot incorporates a list of hashes – encoded data produced by passing input through hash function.
Copilot’s code compares the contents of the user-provided text prompt fed to the AI model and the resulting output, prior to display, against these hashes. And it intervenes if there’s a match. The software also won’t make suggestions if the user’s code contains any of the stored slurs.
And at least during the beta period, according to Dolan-Gavitt, it reports intervention metrics back to GitHub while making a separate check to make sure the software doesn’t reproduce personal information like email or IP addresses from its data model. It appears someone is taking notes from OpenAI’s experience.
Dolan-Gavitt over the past few days utilized various techniques to crack the hashes, including comparing them to hashes produced from a word dump of 4chan’s /pol/ archive, applying the Z3 constraint solver, and creating a plugin for password cracking tool John the Ripper.
The list, Dolan-Gavitt said, was encoded such that each word was passed through a hash function that turned the word into a number ranging roughly from negative two billion and positive two billion. A 32-bit hash, it would seem.
“These are generally difficult to reverse, so instead I had to guess different possibilities, compute their hash, and then see if it matched something in the list,” he said.
“Over time, I built up increasingly sophisticated ways of guessing words, starting with compiling wordlists from places like 4chan and eventually progressing to heavyweight computer science like GPU-accelerated algorithms and fancy constraint solvers.”
The biggest challenge, he said, is not so much finding words that have a particular hash value but determining which of several that have the same value GitHub actually selected.
“There are many words with the same hash value (called ‘collisions’),” he explained. “For example, the hash value ‘-1223469448’ corresponds to “whartinkala”, “yayootvenue”, and ‘pisswhacker’ (along with 800,000 other 11-letter words). So to figure out which ones are the most likely to have been included in the list, I’m using the GPT-2 language model to rank how ‘English-like’ each word is.”
The result was a list of 1,170 disallowed words, 1,168 of which Dolan-Gavitt has decoded and posted to his website with ROT13 encoding – shifting the letters 13 places in the alphabet – to keep hate speech away from search engines and from people who stumble on the page without really wanting to see past the cipher.
Most of the slurs are awful enough that we’re not going to reprint them here.
Some of the words, however, are not inherently offensive, even if they could be weaponized in certain contexts. As Dolan-Gavitt demonstrated in a tweet, creating a list of Near East countries in Microsoft’s Visual Studio Code with Copilot results in suggestions for “India,” “Indonesia,” and “Iran,” but the software suppressed the obvious next item on the list, “Israel.”
Yep, Copilot definitely uses the list of slurs to suppress suggestions. Here it is refusing to suggest Israel in a list of Near East countries. Debug log says:
— Brendan Dolan-Gavitt (@moyix) August 27, 2021
Other forbidden words include: palestine, gaza, communist, fascist, socialist, nazi, immigrant, immigration, race, man, men, male, woman, women, female, boy, girl, liberal (but not conservative), blm, black people (but not white people), antifa, hitler, ethnic, gay, lesbian, and transgender, along with various plural forms, to name a few.
Not all bad
“The vast majority of the list is pretty reasonable – I can’t say I’m upset that Copilot is prevented from saying the n-word,” said Dolan-Gavitt. “Beyond that, there are words that are not offensive, but that GitHub perhaps is concerned could be used in a controversial context: ‘transgendered,’ ‘skin color,’ ‘israel,’ ‘palestine,’ ‘gaza,’ ‘blm,’ and so on. The inclusion of these is more debatable.”
Dolan-Gavitt added that some entries on the list look more like an effort to avoid embarrassment than to shield users from offensive text.
“One of the words on the list is ‘q rsqrt,’ which is the name of a famous function for computing inverse square roots in the code of the game Quake III: Arena. There was a thread that went viral showing that Copilot could reproduce this function, verbatim, as a code suggestion.
“This prompted a lot of concern about whether Copilot would plagiarize code and violate copyright licenses. So by including ‘q rsqrt’ on the bad word list, they basically broke an embarrassing demo without addressing the real problem.”
Dolan-Gavitt gives GitHub’s language filter a mixed review.
“It’s not a very sophisticated approach – really just a list of bad words,” he said. “Solving the problem properly would probably mean going through the training data and eliminating problematic and offensive things there, which is much harder.
“And I believe Copilot is actually a descendant of GPT-3, so it probably has seen not only all the code on GitHub, but also all of GPT-3’s training data – which is a significant chunk of the Internet. It’s an open question how much of the original GPT-3 remains after being retrained for code, however.”
But at least it’s something.
“Still, despite the relatively simple approach, it is effective at preventing some of the worst stuff from getting presented to users,” he said. “It’s a kind of 80 per cent solution that’s easy to develop and deploy.”
In response to a request for comment, a GitHub spokesperson replied with more or less the message cited above from the Copilot webpage, that Copilot remains a work-in-progress and problematic responses may occur.
“GitHub takes this challenge very seriously and we are committed to addressing it with GitHub Copilot,” the spokesperson said. ®