3D Model Generation

3D Generation

3D generation, as the name suggests, is a task to create a 3D model based on a text prompt or a reference image.

Like text2image, it would be nice if a model simply appeared from noise, but just as video has one more dimension in the time axis, 3D generation increases dimensions in the spatial direction, so it cannot be realized easily.

Let me tell you first, 3D generation has not yet reached a performance level for professional use. However, the technology to create models from images and eventually create a walkable world is growing steadily.

Let's follow the flow of this technology, which is in the midst of development.

Note: The chronological order and technical connections are quite loose. Please look at it broadly.

Multi-View Generation

Originally, technologies like NeRF to create 3D spaces/models from images existed. However, to build 3D with NeRF, images of the same object viewed from various viewpoints are required.

Multi-view generation was born from this perspective.

Zero-1-to-3

Zero-1-to-3: Zero-shot One Image to 3D Object

One of the earliest multi-view generations based on diffusion models, it generates an image from a new viewpoint by changing the camera composition of the input image.

At the time, I thought it looked useful regardless of 3D generation, but the required specs were high and I couldn't use it. Now, similar things can be easily done with instruction-based image editing.

Zero123++

Zero-1-to-3 was used as "Make one image of another angle of the input image → Rotate the angle and repeat", but Zero123++ generates multiple viewpoints simultaneously.

Originally, it was known that when diffusion models generate multiple images in batch (cf. Batch & Video), the generated images have some consistency with each other.

3D generation requires images from all directions from the beginning. Zero123++ can be said to be a model that utilized this property and swung in the direction of "making as consistent multi-views as possible in one generation".

Emergence of Video Generation Models

A little later, models capable of video generation began to appear.

Here, instead of treating multi-view generation as "a type of image editing", the idea comes up:

A video shot while going around the target object = Very finely chopped multi-view

Why not treat it as such?

Stable Video 3D

Introducing Stable Video 3D

It is a flow of image2model based on Stable Video Diffusion.

SV3D.json

{
  "last_node_id": 17,
  "last_link_id": 21,
  "nodes": [
    {
      "id": 15,
      "type": "CLIPVisionLoader",
      "pos": [
        90,
        350
      ],
      "size": {
        "0": 315,
        "1": 58
      },
      "flags": {},
      "order": 0,
      "mode": 0,
      "outputs": [
        {
          "name": "CLIP_VISION",
          "type": "CLIP_VISION",
          "links": [
            18
          ],
          "shape": 3
        }
      ],
      "properties": {
        "Node name for S&R": "CLIPVisionLoader"
      },
      "widgets_values": [
        "OpenCLIP-ViT-H-14.safetensors"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 10,
      "type": "SV3D_Conditioning",
      "pos": [
        490,
        240
      ],
      "size": {
        "0": 315,
        "1": 170
      },
      "flags": {},
      "order": 5,
      "mode": 0,
      "inputs": [
        {
          "name": "clip_vision",
          "type": "CLIP_VISION",
          "link": 18,
          "slot_index": 0
        },
        {
          "name": "init_image",
          "type": "IMAGE",
          "link": 20,
          "slot_index": 1
        },
        {
          "name": "vae",
          "type": "VAE",
          "link": 16,
          "slot_index": 2
        }
      ],
      "outputs": [
        {
          "name": "positive",
          "type": "CONDITIONING",
          "links": [
            11
          ],
          "shape": 3,
          "slot_index": 0
        },
        {
          "name": "negative",
          "type": "CONDITIONING",
          "links": [
            12
          ],
          "shape": 3,
          "slot_index": 1
        },
        {
          "name": "latent",
          "type": "LATENT",
          "links": [
            13
          ],
          "shape": 3,
          "slot_index": 2
        }
      ],
      "properties": {
        "Node name for S&R": "SV3D_Conditioning"
      },
      "widgets_values": [
        576,
        576,
        21,
        0
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 14,
      "type": "VAELoader",
      "pos": [
        500,
        470
      ],
      "size": {
        "0": 315,
        "1": 58
      },
      "flags": {
        "collapsed": true
      },
      "order": 1,
      "mode": 0,
      "outputs": [
        {
          "name": "VAE",
          "type": "VAE",
          "links": [
            16,
            17
          ],
          "shape": 3,
          "slot_index": 0
        }
      ],
      "properties": {
        "Node name for S&R": "VAELoader"
      },
      "widgets_values": [
        "vae-ft-mse-840000-ema-pruned.safetensors"
      ]
    },
    {
      "id": 4,
      "type": "CheckpointLoaderSimple",
      "pos": [
        90,
        180
      ],
      "size": {
        "0": 315,
        "1": 98
      },
      "flags": {},
      "order": 2,
      "mode": 0,
      "outputs": [
        {
          "name": "MODEL",
          "type": "MODEL",
          "links": [
            19
          ],
          "slot_index": 0
        },
        {
          "name": "CLIP",
          "type": "CLIP",
          "links": [],
          "slot_index": 1
        },
        {
          "name": "VAE",
          "type": "VAE",
          "links": [],
          "slot_index": 2
        }
      ],
      "properties": {
        "Node name for S&R": "CheckpointLoaderSimple"
      },
      "widgets_values": [
        "SV3D\\sv3d_p.safetensors"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 16,
      "type": "LoadImage",
      "pos": [
        83,
        485
      ],
      "size": [
        352.31848818847664,
        437.0448823632812
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            20
          ],
          "shape": 3
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null,
          "shape": 3
        }
      ],
      "properties": {
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "ComfyUI_01605_.png",
        "image"
      ]
    },
    {
      "id": 8,
      "type": "VAEDecode",
      "pos": [
        1200,
        220
      ],
      "size": [
        162.6986083984375,
        46
      ],
      "flags": {},
      "order": 7,
      "mode": 0,
      "inputs": [
        {
          "name": "samples",
          "type": "LATENT",
          "link": 7
        },
        {
          "name": "vae",
          "type": "VAE",
          "link": 17
        }
      ],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            21
          ],
          "slot_index": 0
        }
      ],
      "properties": {
        "Node name for S&R": "VAEDecode"
      }
    },
    {
      "id": 17,
      "type": "VHS_VideoCombine",
      "pos": [
        1394,
        225
      ],
      "size": [
        492.2207889102224,
        704.2207889102224
      ],
      "flags": {},
      "order": 8,
      "mode": 0,
      "inputs": [
        {
          "name": "images",
          "type": "IMAGE",
          "link": 21
        },
        {
          "name": "audio",
          "type": "VHS_AUDIO",
          "link": null
        },
        {
          "name": "batch_manager",
          "type": "VHS_BatchManager",
          "link": null
        }
      ],
      "outputs": [
        {
          "name": "Filenames",
          "type": "VHS_FILENAMES",
          "links": null,
          "shape": 3
        }
      ],
      "properties": {
        "Node name for S&R": "VHS_VideoCombine"
      },
      "widgets_values": {
        "frame_rate": 8,
        "loop_count": 0,
        "filename_prefix": "AnimateDiff",
        "format": "image/gif",
        "pingpong": false,
        "save_output": false,
        "videopreview": {
          "hidden": false,
          "paused": false,
          "params": {
            "filename": "AnimateDiff_00014.gif",
            "subfolder": "",
            "type": "temp",
            "format": "image/gif"
          }
        }
      }
    },
    {
      "id": 3,
      "type": "KSampler",
      "pos": [
        855,
        220
      ],
      "size": {
        "0": 315,
        "1": 262
      },
      "flags": {},
      "order": 6,
      "mode": 0,
      "inputs": [
        {
          "name": "model",
          "type": "MODEL",
          "link": 14
        },
        {
          "name": "positive",
          "type": "CONDITIONING",
          "link": 11
        },
        {
          "name": "negative",
          "type": "CONDITIONING",
          "link": 12
        },
        {
          "name": "latent_image",
          "type": "LATENT",
          "link": 13
        }
      ],
      "outputs": [
        {
          "name": "LATENT",
          "type": "LATENT",
          "links": [
            7
          ],
          "slot_index": 0
        }
      ],
      "properties": {
        "Node name for S&R": "KSampler"
      },
      "widgets_values": [
        12345,
        "fixed",
        20,
        8,
        "dpmpp_2m",
        "karras",
        1
      ]
    },
    {
      "id": 11,
      "type": "VideoTriangleCFGGuidance",
      "pos": [
        526,
        126
      ],
      "size": [
        238,
        58
      ],
      "flags": {},
      "order": 4,
      "mode": 0,
      "inputs": [
        {
          "name": "model",
          "type": "MODEL",
          "link": 19
        }
      ],
      "outputs": [
        {
          "name": "MODEL",
          "type": "MODEL",
          "links": [
            14
          ],
          "shape": 3,
          "slot_index": 0
        }
      ],
      "properties": {
        "Node name for S&R": "VideoTriangleCFGGuidance"
      },
      "widgets_values": [
        1
      ],
      "color": "#232",
      "bgcolor": "#353"
    }
  ],
  "links": [
    [
      7,
      3,
      0,
      8,
      0,
      "LATENT"
    ],
    [
      11,
      10,
      0,
      3,
      1,
      "CONDITIONING"
    ],
    [
      12,
      10,
      1,
      3,
      2,
      "CONDITIONING"
    ],
    [
      13,
      10,
      2,
      3,
      3,
      "LATENT"
    ],
    [
      14,
      11,
      0,
      3,
      0,
      "MODEL"
    ],
    [
      16,
      14,
      0,
      10,
      2,
      "VAE"
    ],
    [
      17,
      14,
      0,
      8,
      1,
      "VAE"
    ],
    [
      18,
      15,
      0,
      10,
      0,
      "CLIP_VISION"
    ],
    [
      19,
      4,
      0,
      11,
      0,
      "MODEL"
    ],
    [
      20,
      16,
      0,
      10,
      1,
      "IMAGE"
    ],
    [
      21,
      8,
      0,
      17,
      0,
      "IMAGE"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "0246.VERSION": [
      0,
      0,
      4
    ]
  },
  "version": 0.4
}

Input one still image
Generate a 360-degree video where the object rotates
Treat each frame of the video as a separate viewpoint image and restore 3D from there

The flow of applying video generation models to 3D model generation continues to this day.

Current video generation models are much higher performance than at this time, so you can generate high-definition 360-degree rotation videos without specialized fine-tuning.

Models Aiming Directly for image→3D Model

Up to this point, the premise was a "two-stage configuration":

First collect multi-views (or rotation videos)
Create 3D with another mechanism

From there, models are emerging that go one step further and aim directly for the form:

Input is image (or text), output is suddenly 3D model

Hunyuan3D-2.1

Hunyuan3D-2.1 is a large-scale model for creating 3D assets from images or text.

First stage to output only the "shape" part (rough 3D shape)
Second stage to apply high-resolution appearance including PBR textures

It has a two-stage configuration like this.

SAM 3D Objects

SAM 3D Objects is a model that restores 3D objects from a single real photo.

On the 2D side, use SAM-based segmentation to firmly cut out the target object
Using the cut-out area as a clue, estimate the 3D shape and texture while complementing the hidden parts

It follows this flow.

Although the technical contents are completely different, both are trying to solve "image → 3D model" head-on.

World Models

So far, we have discussed modeling a single object. On the other hand, attempts to create a whole world from photos are also progressing.

The "World Model" here means a model that constructs a 3D world (scene), rather than a world model (prediction of physics).

360-degree Panorama Generation

The start is 360-degree panorama generation.

Tools from Latent Labs and HunyuanWorld-1.0 correspond to this.

Paste the input image onto a panoramic sphere
Supplement the directions not shown with outpainting

With this simple idea, create a "look that is filled 360 degrees for now".

At this stage, it cannot be called 3D yet, but by combining depth maps and mesh restoration here, they are trying to build a 3D space with depth.

HunyuanWorld-Mirror

When it comes to HunyuanWorld-Mirror, it gets closer to creating a world that you can essentially walk around in.

It consists of components such as estimating camera information, depth, and 3D representation (3D Gaussian, etc.) collectively with an image (or video) as input.