Hidden 3D Information in Videos


As an example, consider a CGI computer animation of a car race scene featuring a Ford Shelby GT500. The aim is to identify the video scene with temporal data, annotate the region of interest depicting the vehicle as a moving region with spatiotemporal segmentation, and describe the video scene. Assume that the 3D model of the vehicle was created in AutoDesk 3ds Max, which has to be described by 3D object features, such as geometry, shape, diffuse color, specular color, material, transparency, and so on. Using an X3D plugin, the 3D model can be exported to machine-readable, semistructured XML code that utilizes X3D terms (see Fig. 1).

Exporting a 3D Model to X3D

Figure 1 A 3D model created in AutoDesk 3ds Max holds information about the geometry and 3D characteristics
that can be described with machine-interpretable X3D annotation

The resulting description of the above model can be semantically enriched further by utilizing the 3D Modeling Ontology, such as by declaring the application with which the model was created, the base form of the object, and the number of vertices, edges, and faces (see Listing 1).

Listing 1 Fragment of the description logic formalism of the 3D model in Figure 1

3DModel(FORDSHELBYGT500MODEL)
depicts(FORDSHELBYGT500MODEL, FordShelbyGT500)
createdIn(FORDSHELBYGT500MODEL, AutoDesk3dsMax)
baseForm(FORDSHELBYGT500MODEL, Polyhedron)
hasCompound(FORDSHELBYGT500MODEL, Box)
hasCompound(FORDSHELBYGT500MODEL, Cylinder)
hasVertices(FORDSHELBYGT500MODEL, 63281)
hasEdges(FORDSHELBYGT500MODEL, 89448)
hasFaces(FORDSHELBYGT500MODEL, 29816)
coordIndex(mesh1, 0 1 2 -1   3 4 5 -1   6 7 8 -1   9 10 11 -1   12 13 14 -1   15 16 17 -1   18 19 20 -1   21 22 23 -1   24 25 26 -1)
partOf(carpaint, FORDSHELBYGT500MODEL)
diffuseColor(carpaint, 0.110 0.584 0.694)
specularColor(carpaint, 0.000 0.000 0.000)
transparency(carpaint, 0.000)
shininess(carpaint, 0.525)

The geometry of the model can be described precisely with structured annotation by using the terms of the 3D Modeling Ontology in the form of subject–predicate–object (resource–property–value) expressions (RDF triples). First of all, core geometric data, including the number of vertices, edges, and faces,are specified. The geometric base form of the model is declared as a polyhedron, which is modeled as an editable poly in 3ds Max. Two of the geometric primitives of this polyhedron are also declared to further detail the semantics for the geometry of the model (see Listing 2).

Listing 2 Turtle serialization of Listing 1

@prefix dbpedia: <http://dbpedia.org/resource/> .
@prefix ex: <http://example.com/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix 3d: <http://vidont.org/3d/> .
ex:shelbygt500 a 3d:3DModel , foaf:depicts 
dbpedia:Shelby_Mustang ;
3d:createdIn 3d:AutoDesk3dsMax ; 
3d:baseForm 3d:Polyhedron ; 3d:hasCompound 
3d:Box , 3d:Cylinder ; 3d:hasVertices "63281"^^xsd:nonNegativeInteger ;
3d:hasEdges "89448"^^xsd:nonNegativeInteger ; 3d:hasFaces "29816"^^xsd:nonNegativeInteger .
ex:mesh1 3d:coordIndex "0 1 2 -1 3 4 5 -1 6 7 8 
-1 9 10 11 -1 12 13 14 -1 15 16 17 -1 18 19 20 -1 21 22 23 -1 24 25 26 -1"^^xsd:complexType .
ex:carpaint 3d:partOf ex:shelbygt500 .
ex:carpaint 3d:diffuseColor "0.110 0.584 0.694"^^xsd:complexType ; 3d:specularColor "0.000 0.000 0.000"^^xsd:complexType ;
3d:transparency "0.000"^^xsd:decimal ;
3d:shininess "0.525"^^xsd:decimal .

Complex models often have thousands of properties; for demonstrational purposes only some of them are listed here, such as some color properties (diffuse color, specular color), transparency, and shininess of the painted parts of the car. The numeric data types are declared using standard XSD datatypes whenever possible; the 3D Modeling Ontology also defines specialized datatypes, mostly by setting additional constraints on standard XSD datatypes.
By using Media Fragment URI 1.0 identifiers, the spatiotemporal segmentation of videos containing 3D models can be done as follows. The positions of the selected shots are specified in Normal Play Time format according to RFC 2326, which is the default time scheme for Media Fragment URIs. The movie characters are represented by the top left corner coordinates and the dimensions of the imaginary surrounding rectangles, as shown in Fig. 2.

Spatiotemporal Annotation

Figure 2 Spatial annotation of a region of interest to be annotated using a Media Fragment 1.0 URI

Using a description logic formalism, this video scene can be represented as shown in Listing 3.

Listing 3 Description logic formalism of a video scene

Video(CARRACE)
Scene ⊑ VideoSegment
Scene(OVERTAKING)
sceneFrom(OVERTAKING, CARRACE)
hasStartTime(OVERTAKING, 00:00:05)
duration(OVERTAKING, 00:00:07)
hasFinishTime(OVERTAKING, 00:00:12)
depicts(OVERTAKING, overtaking)
3dsMaxModel(FORDSHELBYGT500MODEL)
partOf(OVERTAKINGROI, OVERTAKING)
MovingRegion(OVERTAKINGROI)
depicts(OVERTAKINGROI, FORDSHELBYGT500MODEL)

This formal description can be written in any RDF serialization, such as RDF/XML, Turtle, Notation3, N-Triples, N-Quads, and any compatible lightweight annotation, such as RDFa, HTML5 Microdata, and JSON-LD. Listing 4 shows the Turtle serialization of the above example.

Listing 4 Spatiotemporal annotation of a video scene in Turtle

@prefix 3d: <http://vidont.org/3d/> .
@prefix dbpedia: <http://dbpedia.org/resource/> .
@prefix foaf: <http://xmlns.org/foaf/0.1/> .
@prefix mpeg-7: <http://mpeg7.org/> .
@prefix temporal: <http://swrl.stanford.edu/ontologies/built-ins/
 3.3/temporal.owl> .
@prefix schema: <http://schema.org/> .
@prefix vidont: <http://vidont.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> 
<http://vidont.org/carrace.mp4> a schema:video .
<http://example.com/carrace.mp4#t=0:00:05, 0:00:12> a 
mpeg-7:VideoSegmentTemporalDecompositionType , vidont:Scene ; vidont:sceneFrom <http://example.com/carrace.mp4> ; 
temporal:hasStartTime "00:00:05"^^xsd:time ; 
temporal:duration "PT00M07S"^^xsd:duration ; 
temporal:hasFinishTime "00:00:12"^^xsd:time ; foaf:depicts dbpedia:overtaking .
ex:FordShelbyGT500Model a 3d:3dsMaxModel .
<http://example.com/carrace.mp4#t=0:00:05,0:00:12&xywh=164,51,454,827> a mpeg-7:VideoSegmentSpatioTemporalDecompositionType , 
mpeg-7:MovingRegionType ; 3d:partOf <http://example.com/carrace.mp4#t=0:00:05, 0:00:12> ; foaf:depicts ex:FordShelbyGT500Model .

The formal definition of the terms used in the video scene description above are retrieved from MPEG-7, VidOnt, the SWRL Temporal Ontology, DBpedia, FOAF, and Schema.org, by declaring their namespaces and using the corresponding prefixes in the RDF triples as usual. Note that the vocabulary of the MPEG-7 standard was originally written in XSD, and there have been several attempts to map this vocabulary to OWL. Among these, the mapping at mpeg7.org has a stable namespace and a complete coverage of MPEG-7 terms, hence it is used here.
In Turtle, a is a shorthand notation for the rdf:type predicate. The above example also exploits that a series of RDF triples sharing the same subject can be abbreviated by stating the subject once, and then each predicate-object pair separated using a semicolon.
Note that similar to other knowledge representations, some of these annotations can be automatically generated based on metadata and/or low-level feature extraction, but the majority of rich semantics still need human decision and judgment.
The indexing of 3D properties expressed in structured data enables 3D model retrieval by 3D features. Intelligent services that perform reasoning over the formally represented 3D models can find computer animations that feature a 3D model that is similar or identical to a particular 3D model, represents a real-life object made of the same material, or is transparent to a certain degree.