CERTH’s Live Volumetric Studio in TransMIXR

How can we create volumetric video (light stages, other vv setups, sensors etc.)

Volumetric video creation in general is a high-end sport. It requires a significant amount of human effort and very expensive dedicated equipment. The type of equipment needed comprises multiple optical sensors, lighting, processing units, as well as infrastructure for connecting and controlling every piece of hardware. Professional volumetric capturing studios come in different formats, such as light stages [guo2019], or green-screen stages [4dviews]. In both cases, the lighting is of utmost importance, with the first format almost entirely relying on lighting setup to offer the ability to create relightable subjects, while in the latter case, the subject is lit as uniformly as possible to remove any uncanny effects when imported in 3D engines. Volumetric capturing sys many features regarding connectivity, sensor control, as well as sensor spatio-tems are highly complex, with many pieces of different technologies, such as digital cameras, IR sensors, and IR projectors, distributed hardware for saving captures, hardware for controlling and synchronising all the sensors deployed, as well as the lighting infrastructure. 

 

How do we create volumetric video (high level overview of our system)?

In TransMIXR we developed and deployed a cost-efficient volumetric capturing system relying on consumer-based depth sensor technologies from scratch. CERTH’s volumetric capturing system, called volcap, is a distributed one supporting many features regarding connectivity, sensor control, as well as sensor spatio-temporal calibration. In the high-level, a main processing unit takes the role of a client that communicates with many sensor servers over a Local Area Network (LAN). The system allows for sensor discovery in the network, RGB-D (color and depth) data capturing and streaming, depth filtering, sensor parameter control, spatial calibration (estimating the position of the physical sensor in the physical 3D space), and temporal calibration (synchronising the different sensor feeds). A high-level overview of the system’s architecture is presented in Fig. 1. 

Fig.1: Volcap’s high level hardware architecture. A main PC connects is able to communicate with many sensor processing units via a Network switch. The system is scalable and can support an arbitrary amount of sensors, limited only by the connection’s bandwidth.

Volcap architecture (in detail)

In terms of software, volcap follows a modular, easy-to-extend Entity-Component-System (ECS) architecture. Every sensor is a different entity in the system, with components attached on them, and the systems process the components of the entities. This allows for easy parallelization of processing among entities, as well as handling and controlling sensor data and state. The main systems in our software have to do with device discovery and connectivity, RGB-D data streaming,  data filtering, and data saving. In terms of device connectivity, the system follows a plugin-based implementation, making it very easy to extend the system for supporting different sensor technologies. The only effort required for supporting a device from a different vendor comes to implementing our plugin interface for that specific device. Finally, other software that relies on volcap’s multi-view RGB-D data, such as real-time reconstruction, as well as spatial sensor calibration, is implemented as a remote service. This allows the flexibility of integrating novel methods in isolation from the rest of the codebase (Figure 2). A screenshot of volcap’s UI is presented in Figure 3.

Fig 2: High level volcap software architecture presenting the main systems in our real-time pipeline. The streaming receiving system receives RGB-D data from the network, which are then filtered, presented, and saved.
Fig 3: Volcap’s UI presenting the main software systems

Volcap application features

In our volumetric capture system, we support device discovery for multiple and different devices, ensuring seamless integration and versatility. The system identifies and connects to various capture devices, simplifying setup and allowing users to leverage a range of hardware options for optimal results. Our system is further strengthened by our spatial calibration capabilities, which aligns the spatial positions of the devices to ensure a cohesive capture volume [Sterzentsenko2020]. Furthermore, we have integrated multiple spatial calibration algorithms, such as checkerboard [zhang2000] and ArUco calibration. Temporal calibration complements this by synchronising the capture devices, guaranteeing that every moment is recorded with precise timing, crucial for photorealistic 3D human reconstructions.

 

Beyond capturing, our system performs post-processing and output functionalities. Filtering techniques are employed to clean and refine the captured data, removing noise and enhancing the quality of the volumetric reconstructions. Users can save their projects with ease, ensuring that valuable data is preserved and accessible for future use. Furthermore, out application and save the volumetric captured data to files for later use in real-time. Additionally, our live viewer feature provides real-time visualization of the capture process. For broader applications, the system supports streaming, enabling live broadcasts of volumetric captures to remote reconstruction and rendering service, thereby expanding the possibilities for integration and streaming to multiple platforms.

 

Reconstruction

To support and integrate multiple reconstruction algorithms, we benefit from remote reconstruction. This is a remote service that receives real-time RGB-D streams, that volcap transmits, and reconstructs them based on the reconstruction algorithm selected. So far we have integrated our lab’s method [Alexiadis2017], for live 3d human reconstruction. But more algorithms will be added to our pipeline moving forward.

The way this reconstruction algorithm works is done in 5 separate steps. 1) Having spatial and temporal calibrated multi view rgb-d streams, we perform back-projection to calculate the raw 3d point cloud information that has been captured. 2) Voxelizing the point cloud with a weighted averaging splat technique, which filters out the noise 3D points. 3) Transferring the 3D volume to the Fourier Frequency field, where we apply an integration filter that fills out any gaps in the point cloud. 4) Performing Marching Cubes algorithm to calculate triangles and creating surfaces from the point cloud. 5) Mapping textures to the mesh for accurate weighted colorization.

This work was state of the art when published, but has several limitations such as the number of humans reconstructed, and the square space captured. We have started working on a newer human reconstruction algorithm, based on 3D Gaussian Splats based on [Kerbl2023], and we plan to address the needs of the use cases, such as capturing and reconstructing multiple people and bigger spaces.

Integration for use cases

The TransMIXR project has several use cases, and our Volcap application will be part of two of those, the performing arts use case and the distributed studio use case. The needs of the two use cases are different, one is an offline application in Unreal and the other is real-time in Unity. Both use cases will use the same Volcap application and reconstruction algorithm. So we plan on having two different integrations of our capturing and reconstructing system.

For the performing arts use case, we plan on capturing offline data that we will later use for reconstruction to files that are going to be imported and displayed in unreal. For this use case, the capturing and reconstruction are going to be of higher quality than the distributed studio use case, which requires real-time processing.

For the distributed studio use case, we plan on capturing a human live, and real-time transmitting captured volumetric data to a remote renderer, which will perform both the reconstruction and the rendering. The rendering will be performed from a specified viewpoint, established by the client, in this case Unity, where the Distributed Studio resides. The 2D rendered image is sent to Unity to be displayed in place of the reconstructed human. The remote renderer is faster and easier to integrate than transmitting and displaying reconstructed 3D data.

References:

  1. [guo2019] Guo, K., Lincoln, P., Davidson, P., Busch, J., Yu, X., Whalen, M., … & Izadi, S. (2019). The relightables: Volumetric performance capture of humans with realistic relighting. ACM Transactions on Graphics (ToG), 38(6), 1-19.
  2. [4dviews] https://www.4dviews.com/
  3. [zhang2000] Zhang, Z. (2000). A flexible new technique for camera calibration. IEEE Transactions on pattern analysis and machine intelligence, 22(11), 1330-1334.
  4. [Sterzentsenko2020] Sterzentsenko, V., Doumanoglou, A., Thermos, S., Zioulis, N., Zarpalas, D., & Daras, P. (2020, March). Deep soft procrustes for markerless volumetric sensor alignment. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR) (pp. 818-827). IEEE. 
  5. [Alexiadis2017] Alexiadis, D. S., Chatzitofis, A., Zioulis, N., Zoidi, O., Louizis, G., Zarpalas, D., & Daras, P. (2016). An integrated platform for live 3D human reconstruction and motion capturing. IEEE Transactions on Circuits and Systems for Video Technology, 27(4), 798-813.
  6. [Kerbl2023] Kerbl, B., Kopanas, G., Leimkühler, T., & Drettakis, G. (2023). 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph., 42(4), 139-1.

Authors: CERTH – Centre for Research and Technology Hellas team

Recent Posts

Follow Us

Subscribe to our Newsletter

For more information on TRANSMIXR sign up to the newsletter today!