comet/index.html at master · energy-based-model/comet · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta name="description"
          content="Unsupervised Learning of Compositional Energies">
    <meta name="author"
          content="Yilun Du, Shuang Li, Yash Sharma, Joshua B. Tenenbaum, Igor Mordatch">

    <title>Unsupervised Learning of Compositional Energy Concepts</title>
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css"
          integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" crossorigin="anonymous">

    <!-- Custom styles for this template -->
    <link href="offcanvas.css" rel="stylesheet">
    <!--    <link rel="icon" href="img/favicon.gif" type="image/gif">-->
</head>

<body>
<div class="jumbotron jumbotron-fluid">
    <div class="container"></div>
    <h2>Unsupervised Learning of Compositional Energy Concepts</h2>
    <h3>NeurIPS 2021</h3>
    <hr>
    <p class="authors">
        <a href="https://yilundu.github.io">Yilun Du</a>,
        <a href="https://shuangli59.github.io/"> Shuang Li</a>,
        <a href="https://www.yash-sharma.com/"> Yash Sharma</a>,
        <a href="https://mitibmwatsonailab.mit.edu/people/joshua-tenenbaum/"> Joshua B. Tenenbaum</a>,
        <a href="">Igor Mordatch</a>
    </p>
    <div class="btn-group" role="group" aria-label="Top menu">
        <a class="btn btn-primary" href="https://arxiv.org/abs/2111.03042">Paper</a>
        <a class="btn btn-primary" href="https://github.com/yilundu/comet">Code</a>
    </div>
</div>


<div class="container">
    <div class="section">
        <div class="row align-items-center">
            <div class="col justify-content-center text-center">
                <video width="100%" playsinline="" autoplay="" loop="" preload="" muted="">
                    <source src="fig/teaser.mp4" type="video/mp4">
                </video>
            </div>
        </div>
        <br>

        <p>
            Humans are able to rapidly understand scenes by utilizing concepts extracted from prior experience. Such concepts
            are diverse, and include global scene descriptors, such as the weather or lighting, as well as local scene descriptors,
            such as the color or size of a particular object. So far, unsupervised discovery of concepts has focused on either
            modeling the global scene-level or the local object-level factors of variation, but not both. In this work, we propose COMET,
            which discovers and represents concepts as separate energy functions, enabling us to represent both global concepts as well as
            objects under a unified framework. COMET discovers energy functions through recomposing the input image, which we find captures
            independent factors without additional supervision. Sample generation in COMET is formulated as an optimization process on underlying
            energy functions, enabling us to generate images with permuted and composed concepts.  Finally, discovered visual concepts in COMET
            generalize well, enabling us to compose concepts between separate modalities of images as well as with other concepts discovered by a separate
            instance of COMET trained on a different dataset.
        </p>
    </div>

    <div class="section">
        <h2>Video</h2>
        <hr>
        <div class="vcontainer">
            <iframe class='video' src="https://www.youtube.com/embed/bsXaLiDr8Xg" frameborder="0"
                    allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture"
                    allowfullscreen></iframe>
        </div>
    </div>

    <div class="section">
        <h2>The trouble with &beta;-VAE, MONET & Co.</h2>
        <hr>
        <p>
            Unsupervised discovery of visual concepts has achieved great success, and has seen applications in reinforcement learning and planning.
            Despite this, existing unsupervised decomposition methods are, to date, limited in their capabilities. One underlying weakness is that
            they are often tailored to the particular type of concept under consideration, with approaches such as &beta;-VAE excelling at representing
            global factors of variation, and other approaches such as MONET excelling at representing object factors of variation. Furthermore,
            inferred concepts in either case do not generalize well. While humans are able to combine concepts learned from one experience with those from another
            - there does not exist an way to combine a concept inferred from one dataset with another concept inferred from a separate dataset.
        </p>

    </div>

    <div class="section">
        <h2>Composable Energy Network</h2>
        <hr>
        <p>
            We introduce Composable Energy Network, or COMET, which overcomes these difficulties. COMET discovers and represents visual concepts
            as a set of <b>composable energy functions</b>. Each individual inferred energy function defines a separate <b>minimal energy state</b> which
            represents either a global or local factor of variation. Multiple energy functions are then composed together through summation to define a new minimal energy
            state consisting of each specified factor of variation. Individual composed energy functions <b>generalize well</b> and can be composed with
            energy functions inferred by other COMET models on <b>disjoint</b> datasets. First, we illustrate the ability of COMET to capture <b>global factors of variation</b>.
            Below, we illustrate the gradients of each energy function with respect to an image, indicating aspects of an image an energy function pays attention to.
            We further label each energy function with a posited global factor through visual inspection.
        </p>

        <div class="row align-items-center">
            <div class="col justify-content-center text-center">
                <img src='fig/posited_factors_global.png' class='img-fluid'>
            </div>
        </div>

        <p>
        We may further recombine individual energy functions discovered in both of the above domains together.
        </p>

        <div class="row align-items-center">
            <div class="col justify-content-center text-center">
                <img src='fig/recombine.png' class='img-fluid'>
            </div>
        </div>
    </div>

    <div class="section">
        <h2>Object Level Decomposition </h2>
        <hr>
        <p>
            Decomposed energy functions may also represent local object factors of variation.
            Below, we illustrate how COMET enables us to decompose and recombine energy functions representing both local and global factors of
            variation <b>simultaneously</b>, as well as compose multiple instances of objects together at once.
        </p>
        <div class="row align-items-center">
            <div class="col justify-content-center text-center">
                <video width="100%" playsinline="" autoplay="" loop="" preload="" muted="">
                    <source src="fig/object.mp4" type="video/mp4">
                </video>
            </div>
        </div>
    </div>


    <div class="section">
        <h2>Compositional Generalization</h2>
        <hr>
        <p>
        We find that inferred energy functions from COMET can further <b>generalize</b> well. In the main paper, we illustrate how we may compose energy functions
        across separate modes of data. Below, we illustrate how we can combine energy functions from one COMET model trained on one dataset with energy functions
        from another COMET model trained on a separate dataset.
        </p>

        <div class="row align-items-center">
            <div class="col justify-content-center text-center">
                <video width="80%" playsinline="" autoplay="" loop="" preload="" muted="">
                    <source src="fig/generalize.mp4" type="video/mp4">
                </video>
            </div>
        </div>

    </div>

        <div class="section">
            <h2>Related Projects</h2>
            <hr>
            <p>
                Check out our related projects on compositional energy functions! <br>
            </p>

            <div class='row vspace-top'>
                <div class="col-sm-3">
                    <img src='img/comp_cartoon.png' class='img-fluid'>
                </div>

                <div class="col">
                    <div class='paper-title'>
                        <a href="https://energy-based-model.github.io/compositional-generation-inference/">Compositional Visual Generation with Energy Based Models</a>
                    </div>
                    <div>
                        We show how EBMs enable <b>zero-shot compositional</b> visual generation, enabling us to compose visual concepts
                        (through operators of conjunction, disjunction, or negation) together in a zero-shot manner.
                        Our approach enables us to generate faces given a  description
                        ((Smiling AND Female) OR (NOT Smiling AND Male)) or to combine several different objects together.
                    </div>
                </div>
            </div>

            <div class='row vspace-top'>
                <div class="col-sm-3">
                    <video width="100%" playsinline="" autoplay="" loop="" preload="" muted="">
                        <source src="img/uncond_gen_half.mp4" type="video/mp4">
                    </video>
                </div>

                <div class="col">
                    <div class='paper-title'>
                        <a href="https://arxiv.org/abs/2012.01316">Improved Contrastive Divergence Training of Energy Based Models</a>
                    </div>
                    <div>
                        We show that the traditional contrastive divergence training objective used to train EBMs
                        is omits a important gradient term. We propose a loss to represent this missing gradient
                        and propose additional tricks to improve EBM training. We show that our resultant models
                        are able to generate high resolutions images and are further able to compose with each
                        other.
                    </div>
                </div>
            </div>

            <div class='row vspace-top'>
                <div class="col-sm-3">
                    <video width="100%" playsinline="" autoplay="" loop="" preload="" muted="">
                        <source src="img/half.mp4" type="video/mp4">
                    </video>
                </div>

                <div class="col">
                    <div class='paper-title'>
                        <a href="https://openai.com/blog/energy-based-models/">Implicit Generation and Generalization with Energy Based Models</a>

                    </div>
                    <div>
                        We introduce a method to scale EBM training to modern neural network architectures.
                        We show that such trained EBMs have a set of unique properties, enabling model robustness,
                        image and trajectory modeling, continual learning and compositional visual generation.
                    </div>
                </div>
            </div>


            <div class="section">
                <h2>Paper</h2>
                <hr>
                <div>
                    <div class="list-group">
                        <a href=""
                           class="list-group-item">
                            <img src="img/paper_thumbnail.png"
                                 style="width:100%; margin-right:-20px; margin-top:-10px;">
                        </a>
                    </div>
                </div>
            </div>

            <div class="section">
                <h2>Bibtex</h2>
                <hr>
                <div class="bibtexsection">
                    @inproceedings{du2021comet,
                      title={Unsupervised Learning of Compositional Energy Concepts},
                      author={Du, Yilun and Li, Shuang and Sharma, Yash and Tenenbaum, B. Joshua
                      and Mordatch, Igor},
                      booktitle={Advances in Neural Information Processing Systems},
                      year={2021}
                    }
                </div>
            </div>

            <hr>

            <footer>
                <p>Send feedback and questions to <a href="https://yilundu.github.io">Yilun Du</a></p>
            </footer>
        </div>

</body>
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
</html>