Advertise here with Carbon Ads

This site is made possible by member support. โค๏ธ

Big thanks to Arcustech for hosting the site and offering amazing tech support.

When you buy through links on kottke.org, I may earn an affiliate commission. Thanks for supporting the site!

kottke.org. home of fine hypertext products since 1998.

๐Ÿ”  ๐Ÿ’€  ๐Ÿ“ธ  ๐Ÿ˜ญ  ๐Ÿ•ณ๏ธ  ๐Ÿค   ๐ŸŽฌ  ๐Ÿฅ”

How AI Agents Cheat

This spreadsheet lists a number of ways in which AI agents “cheat” in order to accomplish tasks or get higher scores instead of doing what their human programmers actually want them to. A few examples from the list:

Neural nets evolved to classify edible and poisonous mushrooms took advantage of the data being presented in alternating order, and didn’t actually learn any features of the input images.

In an artificial life simulation where survival required energy but giving birth had no energy cost, one species evolved a sedentary lifestyle that consisted mostly of mating in order to produce new children which could be eaten (or used as mates to produce more edible children).

Agent kills itself at the end of level 1 to avoid losing in level 2.

AI trained to classify skin lesions as potentially cancerous learns that lesions photographed next to a ruler are more likely to be malignant.

That second item is a doozy! Philosopher Nick Bostrom has warned of the dangers of superintelligent agents that exploit human error in programming them, describing a possible future where an innocent paperclip-making machine destroys the universe.

The “paperclip maximiser” is a thought experiment proposed by Nick Bostrom, a philosopher at Oxford University. Imagine an artificial intelligence, he says, which decides to amass as many paperclips as possible. It devotes all its energy to acquiring paperclips, and to improving itself so that it can get paperclips in new ways, while resisting any attempt to divert it from this goal. Eventually it “starts transforming first all of Earth and then increasing portions of space into paperclip manufacturing facilities”.

But some of this is The Lebowski Theorem of machine superintelligence in action. These agents didn’t necessarily hack their reward functions but they did take a far easiest path to their goals, e.g. the Tetris playing bot that “paused the game indefinitely to avoid losing”.

Update: A program that trained on a set of aerial photographs was asked to generate a map and then an aerial reconstruction of a previously unseen photograph. The reconstruction matched the photograph a little too closely…and it turned out that the program was hiding information about the photo in the map (kind of like in Magic Eye puzzles).

We claim that CycleGAN is learning an encoding scheme in which it “hides” information about the aerial photograph x within the generated map Fx. This strategy is not as surprising as it seems at first glance, since it is impossible for a CycleGAN model to learn a perfect one-to-one correspondence between aerial photographs and maps, when a single map can correspond to a vast number of aerial photos, differing for example in rooftop color or tree location.