A fast variational on-line learning technique for training a transformed 
hidden Markov model. A simplified general model and an associated 
estimation algorithm is provided for modeling visual data such as a 
video sequence. Specifically, once the model has been initialized, an expectation-maximization (“EM”) 
algorithm is used to learn the one or more 
object class models, so that the 
video sequence has high marginal probability under the model. In the expectation step (the “E-Step”), the 
model parameters are assumed to be correct, and for an input image, 
probabilistic inference is used to fill in the values of the unobserved or hidden variables, e.g., the 
object class and appearance. In one embodiment of the invention, a 
Viterbi algorithm and a 
latent image is employed for this purpose. In the maximization step (the “M-Step”), the 
model parameters are adjusted using the values of the unobserved variables calculated in the previous E-step. Instead of using 
batch processing typically used in EM 
processing, the 
system and method according to the invention employs an on-line 
algorithm that passes through the data only once and which introduces new classes as the new data is observed is proposed. By parameter 
estimation and 
inference in the model, visual data is segmented into components which facilitates sophisticated applications in video or 
image editing, such as, for example, object removal or 
insertion, tracking and 
visual surveillance, 
video browsing, photo organization, video 
compositing, and meta data creation.