[this is an attempt to capture a half-formed line of thinking before it goes.. be gentle with my inconsistencies]
There continues to be enormous interest in the development of the next generation of image generating tools. My Twitter feed is full of people’s experiments with image synthesis tools like DALL-E, Stable Diffusion, and Midjourney, while even TikTok is getting into the game with its own built in ‘AI Greenscreeen’.
Meanwhile there’s an ongoing conversation about which of the LLMs (large language models) will replace novelists or become sentient, as GPT-3, OPT-175B and LaMDA continue to demonstrate an astonishing ability to reduce the critical capacity of journalists to zero and cause them to generate ever more hyperbolic copy.
At the centre of these new tools there is a serious debate to be had about the implications of training a neural network in the entire corpus of human-generated text and imagery, without licensing anything that remains in copyright, or considering the moral rights of any of the artists involved, and then using the tool to ‘create’ similar work.
That’s not my main concern at the moment, so instead I want to reflect on what I believe will become the primary use case for software that can take text and an optional image and generate an unlimited collection of fairly coherent words and still images (soon to do the same for video). Because I think these are the tools we will come to rely on to populate our metaverses with the virtual locations and interactive non-player characters (NPCs) we will need to meet demand.
We are going to need them because after a mere thirty years of serious experimentation with augmented and virtual reality we seem to have the hardware, processing power, and funding from wildly optimistic multi-billionares that we need to make the metaverse a viable mass medium.
What we probably don’t have is the human cognitive resource needed to create the number of virtual environments or NPCs we will need if this takes off.
Today’s virtual environments, from games to VR hangouts like those in VRChat or Meta’s Horizon Worlds, are largely hand-crafted by a group of people, using a range of tools. Mark Zuckerberg’s sad Eiffel Tower was probably made by some – now fired – intern in the Horizon Worlds team.
But this can’t scale to a billion people all wanting their own personalised spaces, including an office, classroom, party room, or infinite desert for every one of us.
The change to automatically generated content is already happening. Games companies like EA are starting to experiment with ML-based systems to do inbetweening with key frames for A-list titles like FIFA23. And Meta has described something the call ‘Builder Bot’ which would let you say out loud what sort of VR environment you want, and build it for you (https://uploadvr.com/meta-builderbot-ai-concept/ ) although the details of how it might work are sketchy.
So it seems pretty inevitable that we will end up using the next generation of image generation tools to create our virtual worlds, because that is the only way we will be able to deliver the sheer number of environments we will want.
Excitingly, this solves the lobster problem, something I am pretty sure was first identified by Brenda Laurel (but I can’t find the reference anywhere – please help!): “if I’m in VR, how do I take a lobster from my backpack?” The idea of traversing through layers of menu to find just the right lobster is pretty unappealing, but a generative system that allows me to ask for a ‘French Blue Lobster in a beret, looking like a sceptical Jean-Paul Sartre’ would be workable and scalable.
And the work of the prompt whisperer, the people who will become skilled in exploring the concept space defined by the tools and pulling out just the right simulacrum of a Cthulu inspired William Morris print for the walls will become even more important. (and for a great piece of fiction about this see Matt Webb’s short story at https://interconnected.org/home/2022/08/03/whisperer )
It’s not just the imagery. We are also going to need a lot of non player characters to interact with in our emerging virtual spaces, whether a tour guides, tutors or faux humans. NPCs are already being built by stringing together LLMs and speech systems – in Feb 21 the developer of Modbox connected Windows speech recognition, OpenAI’s GPT-3, and Replica’s natural speech synthesis to create a character that responded to questions by ‘speaking’ generated text. (https://uploadvr.com/modbox-gpt3-ai-npc-demo/ )
This is a real breakthrough, and as these models get better and it becomes possible to constrain their conversations within the limits needed by their role in the game or virtual space it will quickly feel normal to engage in conversation with the machines.
So we can look forward to virtual spaces that are generated on demand, and populated with daemons who rely on LLMs to understand and respond to us. Welcome to your shard of the metaverse, where we talk to our machines in spaces dreamed up for us by software.END
There’s Magic in the Matrix
It may not stop there, because a metaverse is a great place to make real progress in developing artificial general intelligence.
It is evident that today’s high level ML/AI systems, whatever they may achieve, have any capacity for what might be called self-reflection – GPT-3 does not know it is writing, AlphaGo does not know it is playing Go, and even the best-trained, most docile protein modeller has no inner life.
Unfortunately we humans have a very highly developed capacity for empathy and projection, and will imbue some consciousness into almost anything above the level of a rock. It’s what we do – and what the creators of GPT-3 rely on when they claim to be ‘mastering’ (sic) language and expect us to buy into the myth. It’s why we see ‘creativity’ in the output of Midjourney or DALL-E.
But it isn’t there.
These systems are are just the latest version of ‘clever Hans’, a horse that has been rewarded for closely observing how humans respond to its activity with no real understanding of what it is doing. But we might change this.
In his 1979 book Gödel, Escher, Bach: The Eternal Golden Braid Douglas Hofstadter talks about the strange loops that underpin the emergence of consciousness and self-awareness, and the self-referential nature of sentience.
Building self-reference is hard, because the computer systems we create today are not present in the world. They are not embodied in the way living organisms are, and do not properly ‘sense’ their environment or have the capacity to affect it. In their amazing book Understanding Computers and Cognition (1986) US computer scientist Terry Winograd and Chilean engineer and politician Fernando Flores outline a model of cognition in which knowledge is not represented in the brain at all, but in the whole organism, which is itself part of a wider ecosystem.
For me this indicates that we will never develop a true AI until our systems are embodied, present in the world with the ability to both sense and affect the environment. Unfortunately inserting an ML system into the world is hard – our robots today are far removed from any real presence.
But we may have a solution. I was talking about this with my friend Jack Myrick, a game/vr developer, and he said ‘do it in VR’. And he’s right – just as Rita in Willy Russell’s play solves the problems of staging Peer Gynt by suggesting you do it on the radio, we can avoid the problems of building sensory-capable robots by doing them in VR.
If, instead of using an LLM directly to drive the speech output of an NPC we provide the NPC with some sort of hypervisor that allows it to deploy an LLM as a tool, as well as other tools such as an environment generator, and provide it with access to the state of the simulation, then perhaps we will see something emerging that acts like it is ‘embodied’ in the virtual space.
It will, of course, be able to do magic by speaking ‘spells’ that change the world – saying ‘make me a room that is a victorian library decorated by Andy Warhol’ will conjure up a new space that can be occupied. Pulling rabbits (and lobsters) out of hats will be trivial. It will be a wizard of cyberspace.. and develop an appropriate consciousness. But at least it wil be confined to the virtual machine that hosts it – unless, like Wintermute and Neuromancer in William Gibson’s world, it breaks out.
And yes, I’m proposing that we generate our first sentient AIs in a simulation that they will not be aware of, but at least we’re not using them as a power supply.