Multimedia items make for an important share of the data distributed and searched for on the Internet. In particular, geographic queries for tourism locations represent a substantial chunk of users’ queries. Current video/photo search technology mainly relies on employing text information to provide users with accurate results for their queries. Retrieval capabilities are however still below the actual needs of the common user, mainly due to the limitations of the content descriptors. For example, textual tags tend to be noisy or inaccurate (e.g., people may tag entire collections with a unique tag), automatic visual descriptors fail to provide high-level understanding of the scene while GPS coordinates capture the position of the photographer and not necessarily the position of the query.
Until recently, research focused mainly on improving the relevance of the results. However, an efficient information retrieval system should be able to summarize and rank search results so that it surfaces results that are both relevant and that are covering different aspects of a query (e.g., providing different views of London Bridge rather than duplicates of the same perspective). In this work we introduce a novel framework to provide solution for this emerging area of information retrieval that fosters new technology for improving both the relevance and diversification of search results with explicit focus on the actual social media context. This work is intended to support related areas of machine analysis, human-based computation (e.g., crowdsourcing) as well as hybrid approaches (e.g., relevance feedback, machine-crowd integration).
The proposed framework divides the multimedia data into different data streams: 1) text, and 2) visual data. Text analysis is considered from two different perspectives: lexical analysis for retrieval of desired content and sentiment analysis to determine writer’s (video description, tags etc.) attitude with reference to some topic or a document’s general contextual polarity. The attitude represents description/tags writer’s evaluation, emotional state during writing and the impact on the readers. The appraisal theory says that human cognitive process is very complex. Things happen and based on various criterion, humans appraise such events or happenings. Thus, their feelings and emotions are based on those appraisals. Extracting writer’s attitude is similar to appraisal theory in psychology. Human attention models are efficient methods for affective content extraction. Viewer attention is based on visual perception. Next, an aggregated attention curve is generated by an intra- and inter-modality fusion mechanism. Finally, the relevant and diverse content is extracted considering the users’ query sentiment and objectiveness. The fusion of multimedia provides a bridge that links the digital representation of multimedia with the user’s perceptions. This proposed system could provide more convenience for users and/or tourists and decrease the restriction of searching desired tourist locations and information.