With the rise of deep learning, computer vision systems have been highly successful at understanding images. However, understanding the dynamic visual world we live in requires both understanding the appearances of individual image frames, and the temporal relationships between them. This thesis aims to understand videos through the lens of time, by learning from the temporal relationships within image sequences, both instantaneously and over a period of time.

In the first half, we focus on using instantaneous motion – temporal changes between neighbouring video frames – to discover moving objects, based on the intuition that the subject in the video usually moves independently from the background. We propose two methods that can solve this task: first, for a single object in a self-supervised manner by grouping motion into layers, and second, for multiple objects over time in a supervised manner using a vision foundation model. We show applications towards general videos, as well as discovering objects with minimal visibility such as camouflages, where we also present the largest video camouflage dataset to date.

In the second half, we go beyond instantaneous changes and learn from patterns of changes over time, from seconds (natural videos) to days (time-lapse videos) to years (longitudinal images). We leverage the properties of time as a direct supervisory signal, and introduce applications that were previously unachievable in computer vision. We first exploit “uniformity” – that time flows at a constant rate, to read analog clocks in unconstrained scenes. We then relax this constraint to “monotonicity” – that certain changes are consistently unidirectional over a period of time, to discover monotonic changes in a sequence of images. For both cases, we also contribute datasets to foster further research.