Research — 31 Aug, 2023

The metaverse and generative AI make for a powerful combination

The fundamental idea behind the metaverse is real-time interaction between people and things in a virtual spatial environment. Regardless of whether that environment is purely virtual or an augmented reality, applications and experiences rely on the content within them. Much current online content is based on easily created text and uploaded images, which is not ideal for a spatial virtual world experience, where 3D objects and digital animation replace or work with the physical world.

Building 3D models is not a simple art to master, and generative AI is being turned toward that goal, building on techniques used for image generation from text descriptions in  OpenAI LLC 's DALL-E and  Midjourney Inc.  This can better enable any user of a virtual environment to create shareable content as easily as they do social media text and image uploads but with much more impact.  S&P Global's 2023 Worldwide Enterprise Metaverse  shows the appeal, with a majority of 1,004 industrial, enterprise business-to-business/consumer (B2B/C) respondents saying they already are, or plan to be, using the metaverse in the next 12 months.

SNL Image

The high-quality content required by the metaverse is hard to create, whether it is a digital twin of factory machinery, a company-branded store for virtual goods or a quirky environment for brainstorming. Generative AI and other related AI-powered approaches offer a way to mitigate the skills shortage and the time-consuming nature of 3D modeling. These technologies enable people in an enterprise environment to, for example, move remote meetings away from basic office replicas to unique environments, making them more memorable. The events industry aims to enhance communication with venues and experiences, and generative AI plus the metaverse can put some of that magic into the simplest of online gatherings. Generating marketing copy and reports in large language models is everywhere, but keep an eye out for more creative and expressive online experiences across the board.

SNL Image

Context

Digital tools to support our ability to communicate and create have been a staple in the evolution of both computing and the web — from the word processor replacing the typewriter to desktop publishing for images and text, to HTML web pages that enable blogs and social media applications to thrive. These have typically been about direct implementation based on the user's choices for creating content.

Photography has evolved from film to smartphones, putting instant stills and video (the typical 2D implementations that suit websites and apps) into the hands of many. The video game industry has developed from 2D interactive experiences (Space Invaders, Pac-Man) into new genres to navigate 3D spaces. Those types of experiences rely on niche skills in 3D modeling, digital animation and player interaction, as well as elements of storytelling.

The digital tools to produce these are an order of magnitude more complex than the basics of cropping a photo or typing a paragraph. Generative AI is being used to produce this content and has the evolving potential to help populate metaverse applications with content created by many more users who would otherwise be unable to.

Understanding 3D modeling and animation

Taking a photo is easy, but hand-drawing a high-quality image or a diagram is much less so. 3D modeling is even more difficult because it is, effectively, digital sculpture. Starting with a digital block and chipping at it with software tools, or molding and combining parts as might be done with clay, is not a quickly learned craft. There are also materials and textures that define an object, which is often accomplished by wrapping a 2D image around a 3D object to make it appear realistic.

The complexity does not end there. Many 3D objects need to have moving parts or some live elements in a simulation. Another specialist art, called rigging, can define a skeleton or joint structure in an object, and apply accurate relationships within it. For example, a human-shaped avatar's elbow joint has a certain range of motion, and an animation that lifts or moves the hand must bring the elbow with it, using a process called inverse kinematics. Muscles and balance must also be created.

This whole process is akin to the filmmaking art of stop-motion animation, which requires extensive model-making and time-consuming positioning. A digital human skeleton might be reused in many figures, but nonhuman shapes like a car may need doors and wheels simulated in the virtual world, and trees may need to look different yet blow realistically in the wind.

Beyond posable 3D images, animation must be applied to the models to make them come alive. A typical movie approach is motion capture, where a digital capture of a person performing an action is mapped into an animation file that can be played back frame by frame. These animations tend to be designed to start and finish in a neutral state so that several animations can be chained together. The animation a user sees is being rendered live, but the animation is a fixed format.

Another approach is to use key poses, and then let the constraints of the model, as with the elbow example, work their way into the position. One of the challenges of any animation is its applicability in different situations (e.g., walking up a slope looks different than walking up the stairs, or on sand or concrete). Are these different animations, or the same one dynamically adjusted? Interactions between any of the objects must be made to behave and make sense in a game context as well. Physics engines and content toolkits can assist, but the industry is still a craft. This level of detail, applied to anything and everything, means that high-end games can take hundreds of experienced people years to produce. Here, as in many other industries, generative AI should begin to help.

Now consider a business user building a presentation deck for a client meeting. What can they easily add to the slides? Text is the primary editable content, followed by images like photos or graphs, but even images become complicated. Finding just the right one in a corporate image bank can take significant time and effort. We are already seeing presentations that credit generative AI applications like DALL-E and Midjourney, which allow the person creating the deck to describe in a short text prompt the sort of image they want. After a few iterations, the user will have an appropriate high-quality, freshly generated image to drop into the deck.

Search engines are being adapted to find existing images and to make them, based on search criteria using generative AI. This sort of image generation has become mainstream in the space of 12 months. It is still the case that talented designers and creators with a vision can generate better images than the basic prompts do, but that threshold applies to the quality of all forms of content.

Tools and libraries do exist to help find 3D content and put it together. However, the sort of business metaverse application that a user needs to generate a presentation deck will not happen for the custom craft of making multibillion-dollar games until it becomes as easy as current generative AI's image generation.

More than models

The creation of visual interactable assets is a major use case for generative AI, but other applications for this wave of artificial intelligence have relevance as well. These include:

  • How does a nonplayer character (NPC) react to a new situation? What do they say or do if it has not been pre-scripted?
  • What sound does an object make as it moves or operates?
  • How can new realistic combinations of situations be presented to an autonomous robot training in a virtual environment?
  • How can a training application adjust to be pushed just above a student's ability in order to maximize the impact?
  • When moving an object from one environment to another, what needs to change to make it contextually appropriate?
  • Does the AI need to run on the cloud with its latency challenge, or can some of it be brought closer to the edge (as with manufacturing ML) or even into the object itself?
  • Which parts of an experience or world are working well, and which are not needed or can be improved?

Examples of metaverse generative AI

One of the companies heavily engaged in AI and the metaverse is  NVIDIA Corp.  It offers applications to turn 2D images into 3D assets to help in virtual training simulations. It also generates new objects based on existing ones, but different enough to provide variety. Its Avatar Creation Engine can help drive AI conversation responses. Inworld is another company in the NPC space of AI. Its developer toolkit can define an in-game character and its personality, motivations and knowledge, and then ask a large language model such as ChatGPT or  Google LLC 's Bard to behave in that manner.  Niantic Inc.  created its demo of forest owl, Wol, using this.

Startup  3Dfy Ltd.  enables users to generate a fully textured 3D object based on a text prompt. It first asks for the type of object, such as a chair or table, to narrow the generation, then applies the text prompt to that. The object can be downloaded in a variety of potential formats, which the tool adjusts the model to meet.

While not technically 3D,  Runway AI Inc. 's Runway.ml is providing generative AI text-to-video or image-to-video generation online, bringing motion as a consideration in any context, not just real-world video capture. Cross-platform avatar tool  Ready Player Me  has a beta of text-to-AI-generated clothing textures.  Roblox Corp.  is also exploring the use of generative AI tools in its cross-platform user-generated content development environment.

Unity Software Inc.  has its Muse generative AI tooling in development to bring object, animation and code generation to a wider developer audience. With Sentis, the company is enabling edge-first AI for developers, reducing cloud reliance at runtime. Its AI Hub helps developers find the right approaches from areas such as generative AI, AI/machine learning integration and behavioral AI.

Conclusion

In 451 Research's "Metaverse: State of Play, Trends and Trajectory" report, we describe the wider place of the metaverse, including the role of AI, in industrial, enterprise and social environments; in our  Generative AI Digest , we address generative AI. The ability to use a simple text description to ask for a shared online 3D environment populated with believable or interesting things is an advance to watch for, beyond all the other generative AI use cases.

This article was published by S&P Global Market Intelligence and not by S&P Global Ratings, which is a separately managed division of S&P Global.
451 Research is part of S&P Global Market Intelligence. For more about 451 Research, please contact 451ClientServices@spglobal.com.

The fundamental idea behind the metaverse is real-time interaction between people and things in a virtual spatial environment. Regardless of whether that environment is purely virtual or an augmented reality, applications and experiences rely on the content within them. Much current online content is based on easily created text and uploaded images, which is not ideal for a spatial virtual world experience, where 3D objects and digital animation replace or work with the physical world.

Building 3D models is not a simple art to master, and generative AI is being turned toward that goal, building on techniques used for image generation from text descriptions in  OpenAI LLC 's DALL-E and  Midjourney Inc.  This can better enable any user of a virtual environment to create shareable content as easily as they do social media text and image uploads but with much more impact.  S&P Global's 2023 Worldwide Enterprise Metaverse  shows the appeal, with a majority of 1,004 industrial, enterprise business-to-business/consumer (B2B/C) respondents saying they already are, or plan to be, using the metaverse in the next 12 months.

SNL Image

The high-quality content required by the metaverse is hard to create, whether it is a digital twin of factory machinery, a company-branded store for virtual goods or a quirky environment for brainstorming. Generative AI and other related AI-powered approaches offer a way to mitigate the skills shortage and the time-consuming nature of 3D modeling. These technologies enable people in an enterprise environment to, for example, move remote meetings away from basic office replicas to unique environments, making them more memorable. The events industry aims to enhance communication with venues and experiences, and generative AI plus the metaverse can put some of that magic into the simplest of online gatherings. Generating marketing copy and reports in large language models is everywhere, but keep an eye out for more creative and expressive online experiences across the board.

SNL Image

Context

Digital tools to support our ability to communicate and create have been a staple in the evolution of both computing and the web — from the word processor replacing the typewriter to desktop publishing for images and text, to HTML web pages that enable blogs and social media applications to thrive. These have typically been about direct implementation based on the user's choices for creating content.

Photography has evolved from film to smartphones, putting instant stills and video (the typical 2D implementations that suit websites and apps) into the hands of many. The video game industry has developed from 2D interactive experiences (Space Invaders, Pac-Man) into new genres to navigate 3D spaces. Those types of experiences rely on niche skills in 3D modeling, digital animation and player interaction, as well as elements of storytelling.

The digital tools to produce these are an order of magnitude more complex than the basics of cropping a photo or typing a paragraph. Generative AI is being used to produce this content and has the evolving potential to help populate metaverse applications with content created by many more users who would otherwise be unable to.

Understanding 3D modeling and animation

Taking a photo is easy, but hand-drawing a high-quality image or a diagram is much less so. 3D modeling is even more difficult because it is, effectively, digital sculpture. Starting with a digital block and chipping at it with software tools, or molding and combining parts as might be done with clay, is not a quickly learned craft. There are also materials and textures that define an object, which is often accomplished by wrapping a 2D image around a 3D object to make it appear realistic.

The complexity does not end there. Many 3D objects need to have moving parts or some live elements in a simulation. Another specialist art, called rigging, can define a skeleton or joint structure in an object, and apply accurate relationships within it. For example, a human-shaped avatar's elbow joint has a certain range of motion, and an animation that lifts or moves the hand must bring the elbow with it, using a process called inverse kinematics. Muscles and balance must also be created.

This whole process is akin to the filmmaking art of stop-motion animation, which requires extensive model-making and time-consuming positioning. A digital human skeleton might be reused in many figures, but nonhuman shapes like a car may need doors and wheels simulated in the virtual world, and trees may need to look different yet blow realistically in the wind.

Beyond posable 3D images, animation must be applied to the models to make them come alive. A typical movie approach is motion capture, where a digital capture of a person performing an action is mapped into an animation file that can be played back frame by frame. These animations tend to be designed to start and finish in a neutral state so that several animations can be chained together. The animation a user sees is being rendered live, but the animation is a fixed format.

Another approach is to use key poses, and then let the constraints of the model, as with the elbow example, work their way into the position. One of the challenges of any animation is its applicability in different situations (e.g., walking up a slope looks different than walking up the stairs, or on sand or concrete). Are these different animations, or the same one dynamically adjusted? Interactions between any of the objects must be made to behave and make sense in a game context as well. Physics engines and content toolkits can assist, but the industry is still a craft. This level of detail, applied to anything and everything, means that high-end games can take hundreds of experienced people years to produce. Here, as in many other industries, generative AI should begin to help.

Now consider a business user building a presentation deck for a client meeting. What can they easily add to the slides? Text is the primary editable content, followed by images like photos or graphs, but even images become complicated. Finding just the right one in a corporate image bank can take significant time and effort. We are already seeing presentations that credit generative AI applications like DALL-E and Midjourney, which allow the person creating the deck to describe in a short text prompt the sort of image they want. After a few iterations, the user will have an appropriate high-quality, freshly generated image to drop into the deck.

Search engines are being adapted to find existing images and to make them, based on search criteria using generative AI. This sort of image generation has become mainstream in the space of 12 months. It is still the case that talented designers and creators with a vision can generate better images than the basic prompts do, but that threshold applies to the quality of all forms of content.

Tools and libraries do exist to help find 3D content and put it together. However, the sort of business metaverse application that a user needs to generate a presentation deck will not happen for the custom craft of making multibillion-dollar games until it becomes as easy as current generative AI's image generation.

More than models

The creation of visual interactable assets is a major use case for generative AI, but other applications for this wave of artificial intelligence have relevance as well. These include:

– How does a nonplayer character (NPC) react to a new situation? What do they say or do if it has not been pre-scripted?

– What sound does an object make as it moves or operates?

– How can new realistic combinations of situations be presented to an autonomous robot training in a virtual environment?

– How can a training application adjust to be pushed just above a student's ability in order to maximize the impact?

– When moving an object from one environment to another, what needs to change to make it contextually appropriate?

– Does the AI need to run on the cloud with its latency challenge, or can some of it be brought closer to the edge (as with manufacturing ML) or even into the object itself?

– Which parts of an experience or world are working well, and which are not needed or can be improved?

Examples of metaverse generative AI

One of the companies heavily engaged in AI and the metaverse is  NVIDIA Corp.  It offers applications to turn 2D images into 3D assets to help in virtual training simulations. It also generates new objects based on existing ones, but different enough to provide variety. Its Avatar Creation Engine can help drive AI conversation responses. Inworld is another company in the NPC space of AI. Its developer toolkit can define an in-game character and its personality, motivations and knowledge, and then ask a large language model such as ChatGPT or  Google LLC 's Bard to behave in that manner.  Niantic Inc.  created its demo of forest owl, Wol, using this.

Startup  3Dfy Ltd.  enables users to generate a fully textured 3D object based on a text prompt. It first asks for the type of object, such as a chair or table, to narrow the generation, then applies the text prompt to that. The object can be downloaded in a variety of potential formats, which the tool adjusts the model to meet.

While not technically 3D,  Runway AI Inc. 's Runway.ml is providing generative AI text-to-video or image-to-video generation online, bringing motion as a consideration in any context, not just real-world video capture. Cross-platform avatar tool  Ready Player Me  has a beta of text-to-AI-generated clothing textures.  Roblox Corp.  is also exploring the use of generative AI tools in its cross-platform user-generated content development environment.

Unity Software Inc.  has its Muse generative AI tooling in development to bring object, animation and code generation to a wider developer audience. With Sentis, the company is enabling edge-first AI for developers, reducing cloud reliance at runtime. Its AI Hub helps developers find the right approaches from areas such as generative AI, AI/machine learning integration and behavioral AI.

Conclusion

In 451 Research's "Metaverse: State of Play, Trends and Trajectory" report, we describe the wider place of the metaverse, including the role of AI, in industrial, enterprise and social environments; in our  Generative AI Digest , we address generative AI. The ability to use a simple text description to ask for a shared online 3D environment populated with believable or interesting things is an advance to watch for, beyond all the other generative AI use cases.

This article was published by S&P Global Market Intelligence and not by S&P Global Ratings, which is a separately managed division of S&P Global.

Gain access to our full news & research coverage and the industry-specific data that informs our insights.