top of page

Motion History Images (MHIs) for Activity Classification in Videos

  • The Codess
  • May 5
  • 9 min read

The following is a final project report I wrote for my Computer Vision class. The project was using something called Motion History Images (MHIs) to identify actions in various images. My implementation didn't work well in all instances, but I found it interesting. Basically, frames selected at different times are subtracted from each other and the derivatives are used to described perceived motion depending on how much the pixels are displaced. The differences are represented by white pixels that slowly fade to gray as the motion becomes older. The differences become clear when placed on a black background. Examples of these images can be found below. The result is ghostly images with many instances of movement from previous frames. In my method, I also added a weight on certain time frames and tracked the speed of the motion to help determine the action. Speed is determined using optical flow, which uses moment information to determine the vector of pixel motion (basically what direction and how far pixels have moved). This helped me determine differences between similar motion like jogging and running.


MHIs are used to track actions, which require sequential information to determine what is taking place. It's a really neat way of capturing motion without the need of complex physics equations. This approach has even been used in tandem with other techniques for pedestrian tracking in self-driving cars. Pretty neat! The following will be a more technical breakdown.


  1. INTRODUCTION


    Human activity recognition has been an interesting field of computer vision for its wide range of applications. Researchers have investigated the role of human activity recognition (HAR) in healthcare, sports, surveillance of crowds, smarthomes, and robotics (Kulsoom et al, 2022). Vision-based HAR presents multiple challenges. Obtaining smooth motion, variations in human shape and size, different lighting conditions, noisy backgrounds, and moving camera viewpoints all present challenges that can cause misclassification. Additionally, activities are different from actions in which they are sequential and require multiple frames to determine what activity is taking place (Kulsoom et al, 2022).

    One common approach to activity classification is the use of motion history images (MHIs). MHIs encode a sequence of motion displayed as a grayscale image of a silhouette, where new information is brighter in pixel intensity. The MHIs ignore everything that has not changed between the current and previous frames by setting these values to zero. MHIs are generated from motion energy images (MEIs), which are binary images created by thresholding the difference image between the current frame and a previous timestep (Ahad et al, 2010). MHIs are widely used in activity classification due to their ability to easily capture motion without complex trajectory computations.

    This paper utilizes MHIs to classify 6 human activities: boxing, handclapping, handwaving, jogging, running, and walking. The MHIs were used to generate the seven standard Hu moments. Hu moments provide numerical values for the shape of the silhouette in the MHIs (Gopal, 2024). Both scale variant and scale variant hu moments are utilized to provide a fuller picture of the motion. When using images and videos of different resolutions, reconstructing these images with scale invariant representations resulted in inexact replications (Van and Postma, 2017). Time-weighted MHIs are also implemented, to place more emphasis on distinguishing features of the action. The MHI is multiplied by a temporal importance function that follows a parabolic function instead of increasing linearly (Komori et al, 2023).



    1. RELATED WORKS

            Activity classification represents a large field of problems in computer vision. Zernetsch et al (2018) used MHIs in combination with ResNet to identify cyclists as crosswalks. The authors utilize databases to identify three classes within an ROI: person, bicycle, and motorbike. They used this trained model to segment the cyclist from the background, then used MHIs to determine if the cyclist is in the state of “waiting” or “moving”. The authors were able to correctly identify the cyclist moving, and correctly disregarded motion of the person on the bicycle that did not lead to forward motion. There were a few instances of false positives when a pedestrian passed by the cyclist, causing their silhouettes to merge.

            In the application of sports, Komori et al. (2023) implements a time-weighted MHI (TW-MHI) to capture nuances in motion. The authors introduce a gamma distribution as a temporal importance function. This redefines the most important frame to not be the latest frame, but rather a frame that is determined through tuning the hyperparameters 𝛼 and 𝜃. These parameters were determined by human experts that could identify keyframes prior to training. However, the authors provide a method of automatically determining the temporal importance function. This involves generating an initial function for every d frame and training the classifier on each one, afterwards, plotting the different functions and accuracy.

    Mohsen et al (2021) look at HAR from the application of smart factories. It is necessary to understand the efficiency of human workers during the manufacturing process. The authors seek to identify six activities: Laying, Downstairs, Sitting,Upstairs, Standing, and Walking. They employ KNN to classify these actions. The authors utilize accelerations and gyroscope positions in the x, y, and z axis as features in their model. The authors achieved 91.46% accuracy with their model when k was increased to 20.



    1. METHOD

    The model used for this paper combined techniques from previous works to achieve high testing accuracy and promotes robustness. The method is split into four stages: preprocessing, MHI generation, Hu moment calculation, and training.


    1. Preprocessing

            177 input videos from the Recognition of Human Actions dataset were divided into 200 frames each and converted to grayscale. From there, a gaussian blur filter is placed over the images with a kernel size of (7x7). Before calculating the MEIs, the median of the frame is subtracted from the fame values. This is to further segment the person from the background of the image, regardless of lighting. This works well in this instance, where the background is plain.


    1. MHI Generation

    The MEIs are generated from the following formula:

       



         

    Where It is the current frame, It-1 is the previous frame, and 𝜃 is a threshold value. 𝜃 determines how different the pixel value has to be from the previous frame to keep in the binary image. If 𝜃 is too small, there will be too many bright pixels, causing blobs instead of clear silhouettes. If 𝜃 is too high, the motion can not effectively be captured. For these experiments, a 𝜃 value of 20 is used for all activities.

            The MEI is then converted to an MHI using the formula:




    Where 𝝉 determines how slowly the previous activity motion decays over time. For these experiments, 𝝉 is set to 20.















Figure 1: (From top left to bottom right) MHIs of boxing, clapping, waving, jogging, running and clapping




  1. Hu Moment Calculation

The image moments of the MHI are defined as:



These are combined with translation and scale invariant moments:

,




And both are converted into the seven Hu moments given by:

        




  1. KNN Classifier

In addition to these 14 features, speed and direction were added to help differentiate between similar movements, such as jogging and running. These features were determined using optical flow, which determines the vector value of each pixel’s motion between a frame and a previous frame. The Hu moments were also scaled logarithmically to make it easier to compare.This produces 16 total features passed into a K-Nearest Neighbors (KNN) classifier. KNN places each data point into a group based on the label provided so that during testing, the classifier places the new data point into its closest group based on the group’s centroid value. The classifier had seven input classes, one for each activity and one extra label titled ‘NO LABEL’ used to signify frames in which no activity took place. Empty frames were commonly found in the walking, jogging, and running video examples.



  1. EXPERIMENTS

Hu moments, speed, and direction features were extracted from 200 equally spaced  frames from 177 randomly selected videos. The processed frames were randomly split into training and test sets consisting of 80 percent and 20 percent of the frames, respectively. The model was then tested on the entirety of 2 sets of videos for each class. One set consisted of normal lighting conditions and the second set consisted of bright lighting and some included zooming. The labels for the frames were updated every ten selected frames to smooth the labeling on the videos and prevent extreme label flickering due to model uncertainty.


  1. DISCUSSION

The average accuracy on the random test frames was 81.33 percent. In the confusion matrix, all of the classes are shown to have the highest predictions on the correct label. However, the model becomes more confused with walking, jogging, and running. This is also supported by the training ROC curve. This issue is expected since the activities are very closely related in shape, speed, and direction. These classes also have the least amount of representation in the dataset. Although approximately the same amount of examples are used for each activity, the person walking, jogging, or running spends the majority of the video out of frame, leading to more “NO LABEL” examples. However, adding “NO LABEL” is crucial because it prevents the model from inaccurately predicting an empty screen to be an activity. In order to counteract this, the classes were resampled to allow equal amounts of representation from each class. This increased accuracy significantly, but does not generalize well because it resamples the same frames multiple times. In future iterations, it would be beneficial to use more examples from classes where the person is not always in frame.

When testing on random video feed, the accuracy falls significantly to 31.79 percent. The model appears to stabilize toward the end of the video when enough history information is accumulated to make an accurate prediction. The most accurate predictions are often toward the final frames of the video, which is where most of the still images were collected from. This may be remedied by increasing tau to allow more information to accumulate in the MHIs during preprocessing. Handclapping particularly is shown to have significant issues, often being confused with boxing and waving. Handwaving had many issues when combined with fast zooming, mixing with running and jogging. This is probably due to incorrect speed calculations. Additionally, the model experiences a lot of label flickering when the person slows down as they exit the frame for running, walking, and jogging. The altering of speed caused the model to switch between these labels quickly. In the future, more label smoothing should be applied to prevent fast changes in prediction. Also, more data at the height of the action is required for greater generalization. An MLP or CNN might be more suitable for a generalizable model; however, it requires much more training time and hyperparameter tuning.

Despite these issues, there are at least a few frames in each video with the correct label. Adding speed and direction helped differentiate between jogging, walking, and running. However, this may have caused issues when analyzing videos with camera zooming in and out and camera shaking. This causes the action to appear faster than it actually is, causing some confusion. There also was a large leap in accuracy when using a combination of scale variant and scale invariant moments. This smoothed the consistency of the model as the silhouettes both changed and maintained shapes. For example, a person whose stride was extended versus a closed stride during running can continuously be classified as running using all fourteen moments. The model performed about the same in different lighting conditions without camera zooming. This was taken care of during preprocessing when the median of the frame was subtracted to segment the person from the background.







Figure 2. KNN training confusion matrix

Figure 3. KNN training ROC





precision

recall

f1-score

support






0

0.99

0.96

0.98

9929

1

0.92

0.96

0.94

10002

2

0.93

0.99

0.96

9951

3

0.94

0.97

0.96

9974

4

0.79

0.76

0.78

10031

5

0.82

0.85

0.84

10041

6

0.82

0.74

0.78

10044






accuracy



0.89

69972

macro avg

0.89

0.89

0.89

69972

weighted avg

0.89

0.89

0.89

69972

Table 1: Training statistical report





















Figure 4: Labeled frames extracted from videos








REFERENCES

Ahad, Md Atiqur Rahman, Jie, Tan, Kim, Hyoungseop, and Ishikawa, S.. (2010). Motion history image: Its variants and applications. Machine Vision and Applications. 23. 255-281. 10.1007/s00138-010-0298-4. 


Gopal, S. (2024). Multi class activity classification in videos using Motion History Image generation. arXiv preprint arXiv:2410.09902.


Komori, H., Isogawa, M., Mikami, D. et al.  (2023). Time-weighted motion history image for human activity classification in sports. Sports Eng 26, 45. https://doi.org/10.1007/s12283-023-00437-1


Kulsoom, Farzana, Narejo, Sanam, Mehmood, Zahid, Chaudhry, Hassan, Butt, Ayesha, and Bashir, Ali. (2022). A Review of Machine Learning-based Human Activity Recognition for Diverse Applications. Neural Computing and Applications. 34. 10.1007/s00521-022-07665-9.


Mohsen, S., Elkaseer, A., & Scholz, S. G. (2021, September). Human activity recognition using k-nearest neighbor machine learning algorithm. In Proceedings of the International Conference on Sustainable Design and Manufacturing (pp. 304-313). Singapore: Springer Singapore.


Van Noord, N., & Postma, E. (2017). Learning scale-variant and scale-invariant features for deep image classification. Pattern Recognition, 61, 583-592. 


Zernetsch, S., Kress, V., Sick, B., & Doll, K. (2018, June). Early start intention detection of cyclists using motion history images and a deep residual network. In 2018 IEEE Intelligent Vehicles Symposium (IV) (pp. 1-6). IEEE.

Recent Posts

See All
The Black Box Problem

A lot of times, the use of neural networks is criticized because it’s known as a “black box”. The meaning can be hard to parse for...

 
 
 

Comments


bottom of page