ICMR 2016


Tutorial 1: Situation Recognition from Multimodal Data
Tutorial 2: On the “Face of Things”

TUTORIAL 1 : Situation Recognition from Multimodal Data



Situation recognition is the problem of deriving actionable insights from heterogeneous, real-time, big multimedia data to benefit human lives and resources in different applications. This tutorial will discuss the recent developments towards converting multitudes of data streams including weather patterns, stock prices, social media, traffic information, and disease incidents into actionable insights.

For multiple decades, multimedia researchers have been building approaches like entity resolution, object detection, and scene recognition, to understand different aspects of the observed world. Unlike the past though, now we do not need to undertake sensemaking based on data coming from a single media element, modality, time-frame, or location of media capture. Real world phenomena are now being observed by multiple media streams, each complementing the other in terms of data characteristics, observed features, perspectives, and vantage points. Each of these multimedia streams can now be assumed to be available in realtime and increasingly larger portion of these come inscribed with space and time semantics. The number of such media elements available (e.g. Tweets, Flickr posts, sensors updates) is already in the order of trillions, and computing resources required for analyzing them are becoming increasingly available. We expect these trends to continue and one of the biggest challenges in multimedia computing in the near term to be that of concept recognition from such multimodal data. As shown in Figure 1, the challenges in situation recognition are fundamentally different from those in object or event recognition. They involve dealing with multiple media, each capturing real world phenomena from multiple vantage locations spread over time.

Detecting situations in time to take appropriate actions for saving lives and resources can transform multiple aspects of human life including health, natural disaster, traffic, economy, social reforms, business decisions and so on. Examples of such relevant situations include beautiful-days/ hurricanes/ wildfires, traffic (jams / smooth/ normal), economic recessions/ booms, blockbusters, droughts/ great-monsoons, seasons (early-fall/ fall/ latefall), demonstrations/ celebrations, social uprisings/ happinessindex, flash-mobs, flocking and so on.

This tutorial will provide the audience with a thorough theoretical and practical grounding on the field of situation recognition. It will bring together the work by multiple scholars working in the area of situation recognition both within and outside the multimedia research community. The attendees will be introduced to the different interpretations of situations across multiple fields, and how it builds upon and extends the efforts on object detection, event detection, scene recognition and so on. The tutorial will provide a review of recent efforts within the multimedia community towards detecting real-time situations, and the attendees will be introduced to multiple practical situation recognition approaches and applications. Specific attention will be paid to discussing the relevant open research challenges for the community to extensively advance the state of the art in situation recognition.

The Multimedia research community is particularly well positioned to take on the challenge of situation recognition. First, the community’s core competence lies in handling heterogeneous media (audio, video, text, phone logs, micro-blogs, sensors etc.). Equally importantly, it is the only research community which has studied concept detection across both time (event detection), and space (like spatial organization of pixels in images). The tools for raster image processing translate directly onto spatial data layouts and notions of neighborhood, regions, boundaries, and motion vectors translate seamlessly across to spatio-temporal analysis. Recognizing situations in the right place at the right moment to take appropriate actions can save lives and resources in various aspects of human life. For instance, in healthcare domain, approximately 250,000 people die prematurely each year from asthma attacks, and almost all of these deaths are avoidable. Asthma patients can be notified to avoid the places that can cause their asthma attacks when they are in critical situation. Similar applications of situation recognition lie in areas including emergency evacuations, business analytics, and epidemic detection.

Figure 1: Different Types of Concepts can be detected in different data availability settings. Single media, such as images, results in concepts more in images than in the real world, but using different media it is possible to detect concepts in the real world


At the end of the tutorial the attendees should be able to:

1. Describe the problem of situation recognition and how it is different from object detection, event recognition, scene understanding etc.

2. Outline the different interpretations of situations across different fields e.g. multimedia, ubiquitous computing, robotics, aviation etc.

3. Articulate a computational definition for the concept of “situation” and the problem of situation recognition.

4. Identify the important categories of operators needed for the task of situation recognition.

5. Relate to the practical experience of creating at least one practical situation recognition application using an opensource situation recognition toolkit.

6. Articulate the emerging trends in situation-based computing and identify the open challenges in the field of situation recognition.


1) Concept recognition from Multimedia data

As shown in Figure 1, Situation recognition builds on and extends object recognition, scene recognition, activity and event recognition, and complex event processing. It pushes the envelope beyond detecting intra-media concepts (i.e. those which manifest themselves, and can be detected within a single media object e.g. a tree, or a chair in an image), to define and making quick progress on detecting real-world concepts (i.e. those which occur in real world, are constantly evolving, and inherently manifest themselves over heterogeneous multimedia streams from numerous sources). As a simple example, we may now look beyond the problem of creating a tree detector and/or testing it over Millions of Flickr images; to that of using a stream of Billions of such images and other available data to detect seasonal patterns, plant disease spreads, deforestation trends, or global warming. These are the problems which could not be tackled earlier because of the lack of data and computational resources; but those are no longer the bottlenecks.

2) Situation recognition across multiple research domains

There has been a large amount of work done in the areas like ubiquitous/pervasive computing building automation, mobile application software, aviation/air traffic control, robotics, industrial control, military command and control, surveillance, linguistics, stock market databases, multimodal presentation, about situation modeling, situation awareness, situation calculus, situation control, and situation semantics (c.f. [1, 2]). The interpretation of situation however is different across different areas and even across different works within the same area. Multiple such interpretations will be discussed and the differences as well as the commonalities in these interpretations will be identified.

3) Situation recognition

To combine heterogeneous realtime data streams into actionable situations, one may focus on the spatiotemporal commonality across streams to integrate them. A recent approach proposed in the multimedia community focuses on using a simple unified representation (called EmageEventbased- image), that indexes and organizes all data into a common representation. Similarly, for going from individual data nuggets (micro-events) to macro-situations it uses a set of generic spatiotemporal analysis operators [3, 4]. Other researchers have identified variants of this representation called Cmages (Conceptbased- images) to allow for concept-based-images that can be created using heterogeneous data and allow for information fusion and situation recognition [5]. Yet other set of researchers have defined a FraPPE (Frame, Pixel, People, Events) ontology for combining heterogeneous spatio-temporal data into actionable insights [6]. A common thread in all these approaches is the unified representation of heterogeneous data as pixels and defining analytics operators on them. The salient points of these approaches and the application of the defined operators for situation recognition will be discussed.

Figure 2: An Emage Showing User Interest across Mainland US in Terms of Number of Tweets Containing the Term ‘iPhone’ on 11th June 2009

4) Designing situation based applications

Participants will be exposed to multiple platforms that support integration of heterogeneous multimedia data for situation recognition. The open-source framework of EventShop focuses on declarative operators that work on top of the Emage representation [3]. The FraPPE model works on an ontology model and employ a human in the loop to undertake Visual Analytics [6]. The tutorial will discuss multiple examples, and the participants will get hands-on experience with creating one practical application – asthma risk score for any location.

5) Future trends and open problems

There exist multiple open problems in situation recognition. These include issues of scalability, network based models for situations, data discovery, dealing with data uncertainty, and spatio-tempoal concept detection. Consequently there lie tremendous opportunities in leveraging and repurposing multimedia processing techniques (e.g. convolution, vector representation, optical flow) for analysis of spatio-temporal gridded data and spatio-temporal situation recognition. Further, newer trends in computational crowd-sourcing and human-in-the-loop methods for multimedia processing tasks can be used for distributed multimedia data generation and processing. This tutorial will encourage and engage the multimedia researchers in tackling these problems.


[1] V.K. Singh, “Personalized Situation Recognition”, Advisor- Ramesh Jain, Ph.D. Thesis, University of California, Irvine. 2012.

[2] Dousson, C., Gaborit, P., & Ghallab, M. (1993, August). Situation recognition: representation and algorithms. In IJCAI (Vol. 93, pp. 166-172).

[3] V. K. Singh, M. Gao and R. Jain, "Social pixels: Genesis and evaluation," in Proceedings of the International Conference on Multimedia, 2010, pp. 481-490.

[4] V. K. Singh, M. Gao and R. Jain, "Situation recognition: an evolving problem for heterogeneous dynamic big multimedia data." Proc. Int. Conf. on Multimedia, 2012

[5] Wang, Y., & Kankanhalli, M. S. (2015, May). Tweeting Cameras for Event Detection. In Proceedings of the 24th International Conference on World Wide Web (pp. 1231-1241). International World Wide Web Conferences Steering Committee.

[6] Balduini, M., Della Valle, E., Azzi, M., Larcher, R., Antonelli, F., & Ciuccarelli, P. (2015). Citysensing: visual story telling of city-scale events by fusing social media streams and call data records captured at places and events. IEEE MultiMedia, 22(3).

TUTORIAL 2 : On the “Face of Things”



Our face describes our identity, carries our emotions, reflects our minds, conveys much more information, and yet is constantly changing over time. For instance, by looking at the face shape or wrinkles, we can tell age; by looking at different expressions, we can tell someone is happy or sad; by checking facial features, we can infer someone’s race etc. In fact so much more information can be deduced (as if by Sherlock Holmes) from just a face, and yet a face can also be very deceiving at times.

In this tutorial, we identify and discuss the following research challenges in current Face Detection/Recognition research and related research areas:

- Unavoidable Facial Feature Alterations - our age, health status, living habits, environment, etc all play important roles in facial feature changes (e.g., wrinkles, cloudy eye). How do we identify faces with these unavoidable feature changes?

- Voluntary Facial Feature Alterations - the pursuit of beauty drives people to all kinds of facial variations: change of hair style, make-up, etc. How do we identify faces with these voluntary feature changes?

- Uncontrolled Environments - other than the face itself, the quality of the media also plays a crucial rule to the accuracy of face identification. How do we take account of uncontrolled environment variables?

- Accuracy Control on Large-scale Dataset - the extracted facial features or facial identification algorithms should be scalable when applied to large-scale dataset.

Target audience, and prerequisite knowledge required: Beginners and inter-mediate level in terms of face detection, face recognition and facial feature extraction. Exposure and interest in such or related topics can serve as a good pre-requisite.

Face is crucial for human identity, while face identification has become crucial to information security. It is important to understand and work with the problems and challenges for all different aspects of facial feature extraction and face identification. This tutorial will outline these challenges and discuss existing research solutions.


To understand the fundamentals of facial feature & face identification; and discuss the current challenges, future trends, and various fields of application of facial feature study.


1) Fundamentals of Facial Features and Face Identification

Brief reviews on facial feature extraction and face identification; introduce the new challenges within the topic.

2) Challenge 1: Unavoidable Face Alterations:

Discuss different types of unavoidable face alterations, such as aging, mole growing, scar; analyze the various ways in which the accuracy can be rendered with these types of alterations; and discussion of ways to handle unavoidable face alterations.

3) Challenge 2: Voluntary Face Alterations:

Discuss different types of voluntary face alterations, such as camouflage, plastic surgery, make-up; analyze the various ways in which the accuracy can be rendered with these types of alterations; and discussion of ways to handle voluntary face alterations.

4) Challenge 3: Uncontrolled Environments

Introduce different environmental variables and how these variables affect the accuracy; and ways to reduce environmental interference.

5) Challenge 4: Accuracy Control in Large-scaled Dataset

Explore the need and importance of accuracy control of any face identification application in large-scaled dataset (including social media dataset). Discuss current issues within applications on large-scale social media data and strategies that can help improve accuracy.

6) Applications and Conclusion

Describe different application areas (spin-offs) of facial feature studies and the associated challenges.