CIKM 2015 is now completed. See you in Indianapolis in 2016!

Workshops & Tutorials

We are happy to announce that the following workshops and tutorials will be held at CIKM 2015.


Monday 19th October

  • PIKM 2015

    The 8th Ph.D. Workshop in Information and Knowledge Management Full day Workshop
    Mouna Kacimi - Free University of Bozen-Bolzano, Italy
    Nicoleta Preda - University of Versailles, France
    Maya Ramanath -Indian Institute of Technology, India

  • HetUM

    User Modeling in Heterogeneous Search Environments Full day Workshop
    Aleksandr Chuklin - University of Amsterdam, The Netherlands & Google Switzerland
    Yiqun Liu -Tsinghua University, China
    Ilya Markov - University of Amsterdam, The Netherlands
    Maarten de Rijke -University of Amsterdam, The Netherlands

  • TM 2015

    Topic Models: Post-Processing and Applications Full day Workshop
    Nikolaos Aletras - University College London, UK
    Jey Han Lau - King's College London, UK
    Timothy Baldwin - The University of Melbourne, Australia
    Mark Stevenson - University of Sheffield, UK

Friday 23rd October


Monday 19th October

  • Information Seeking, Search and Retrieval: Building and using formal Models of search and search behaviour

    Full day
    Leif Azzopardi - University of Glasgow
    Guido Zuccon - Queensland University of Technology

    This full day tutorial focuses on explaining and building formal models of Information Seeking and Retrieval. The tutorial is structured into four sessions. In the first session we will discuss the rationale of modelling and examine a number of early formal models of search (including early cost models and the Probability Ranking Principle). Then we will examine more contemporary formal models (including Information Foraging Theory, the Interactive Probability Ranking Principle, and Search Economic Theory). The focus will be on the insights and intuitions that we can glean from the math behind these models. The latter sessions will be dedicated to building models to optimise particular objectives that drive how users make decisions, in general, (i.e. a how-to guide on model building) and then describe different techniques (including analytical, graphical and computational) that can be used to generate hypotheses from such models. In the final session, participants will be challenged to develop a simple model of interaction applying the techniques learnt during the day, before concluding with an overview of challenges and future directions.

  • Large Scale Distributed Data Science using Apache Spark

    Full day
    Dr. James G. Shanahan - NativeX, University of California, Berkeley
    Liang Dai - NativeX, University of California Santa Cruz

    Apache Spark is an open-source cluster computing framework. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala (and R), and its core data abstraction, the distributed data frame, and it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing.

    This tutorial will provide an accessible introduction to those not already familiar with Spark and its potential to revolutionize academic and commercial data science practices. It is divided into two parts: the first part will introduce fundamental Spark concepts, including Spark Core, data frames, the Spark Shell, Spark Streaming, Spark SQL, MLlib, and more; the second part will focus on hands-on algorithmic design and development with Spark (developing algorithms from scratch such as decision tree learning, graph processing algorithms such as pagerank/shortest path, gradient descent algorithms such as support vectors machines and matrix factorization. Industrial applications and deployments of Spark will also be presented. Example code will be made available in python (PySpark) notebooks.

  • VC-Dimension and Rademacher Averages: From Statistical Learning Theory to Sampling Algorithms

    Matteo Riondato - Two Sigma Investments
    Eli Upfal - Brown University

    Rademacher Averages and the Vapnik-Chervonenkis dimension are fundamental concepts from statistical learning theory. They allow to study simultaneous deviation bounds of em- pirical averages from their expectations for classes of functions, by considering properties of the problem, of the dataset, and of the sampling process. In this tutorial, we survey the use of Rademacher Averages and the VC-dimension for developing sampling-based algorithms for graph analysis and pattern mining. We start from their theoretical foundations at the core of machine learning, then show a generic recipe for formulating data mining problems in a way that allows using these concepts in the analysis of efficient randomized algorithms for those problems. Finally, we show examples of the application of the recipe to graph problems (connectivity, shortest paths, betweenness centrality) and pattern mining. Our goal is to expose the usefulness of these techniques for the data mining researcher, and to encourage research in the area.

  • Scalability and Efficiency Challenges in Large-Scale Web Search Engines

    B. Barla Cambazoglu - Yahoo Labs
    Ricardo Baeza-Yates - Yahoo Labs

    Commercial web search engines need to process thousands of queries every second and provide responses to user queries within a few hundred milliseconds. As a consequence of these tight performance constraints, search engines construct and maintain very large computing infrastructures for crawling the Web, indexing discovered pages, and processing user queries. The scalability and efficiency of these infrastructures require careful performance optimizations in every major component of the search engine. This tutorial aims to provide a fairly comprehensive overview of the scalability and efficiency challenges in large-scale web search engines. In particular, the tutorial provides an in-depth architectural overview of a web search engine, mainly focusing on the web crawling, indexing, and query processing components. The scalability and efficiency issues encountered in the above-mentioned components are presented at four different granularities: at the level of a single computer, a cluster of computers, a single data center, and a multi-center search engine. The tutorial also points at the open research problems and provides recommendations to researchers who are new to the field.

  • Veracity of Big Data: From Truth Discovery Computation Algorithms to Models of Misinformation Dynamics

    Laure Berti-Equille - Qatar Computing Research Institute
    Javier Borge-Holthoefer - Qatar Computing Research Institute

    The evolution of the Web from a technology platform to a social ecosystem has resulted in unprecedented data volumes being continuously generated, exchanged, and consumed. User-generated content on the Web is massive, highly dynamic, and characterized by a combination of factual data and opinion data. False information, rumors, and fake contents across multiple sources can be easily spread, making it hard to distinguish between what is true and what is not. Truth discovery also called fact-checking has recently gained lot of interest in Data Science communities. Ascertaining the veracity of data and understanding the dynamics of misinformation in the Web are two inter-dependent challenges for researchers and practitioners in Databases, Information Retrieval, and Knowledge Management.

    This tutorial explores the progress that has been made in discovering truth, checking facts, and modeling the propagation of falsified and distorted information in the context of Big Data. We will review in details current models, algorithms, and techniques proposed by various research communities in Complex System Modeling, Data Management, and Knowledge Discovery, for ascertaining the veracity of data in a dynamic world. Finally, this tutorial will identify a wide range of open problems and research directions for discovering truth from falsehood(s) in the Web Data and understanding the evolution and propagation of information source trustworthiness.

  • Data Analytics on Social Media & Social Networks

    A/Prof Dr Xue Li - School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, Australia

    Social media and networks are a popular place for people to express their opinions about consumer products, to organize or initiate social events, or to spread news. Some questions would be asked in order to understand the social media and social networks: how can we detect and predict the emerging sensitive events? How can we predict the propagation patterns of online micro-blogs? How can we understand people's opinions about a current issue, a new product, or an important event? This tutorial is to present recent research progress on data analytics on social media and social networks. A few application systems and relevant algorithms will be presented for answering above questions.

    Topics of the tutorial

    1. An introduction on social network mining
    2. Social network data crawling
    3. Sentiment analysis and opinion mining on social networks
    4. Location-aware emerging event detection on social networks
    5. Invariant event detection on social networks
    6. Cyberbullying detection on social networks
    7. Extraction of hot features of social objects on social networks
    8. Big data fusion with social media data

  • Indoor Data Management

    Hua Lu - Aalborg University, Denmark
    Muhammad Aamir Cheema - Monash University, Australia

    A large part of modern life is lived indoors such as in homes, offices, shopping malls, universities, libraries and airports. However, almost all of the existing location-based services (LBS) have been designed only for outdoor space. This is mainly because the global positioning system (GPS) and other positioning technologies cannot accurately identify the locations in indoor venues. Some recent initiatives have started to cross this technical barrier, promising huge future opportunities for research organisations, government agencies, technology giants, and enterprising start-ups -- to exploit the potential of indoor LBS. Consequently, indoor data management has gained significant research attention in the past few years and the research interest is expected to surge in the upcoming years. This will results in a broad range of indoor applications including emergency services, public services, in-store advertising, shopping, tracking, guided tours, and much more. In this tutorial, we first highlight the importance of indoor data management and the unique challenges that need to be addressed. Then, we provide an overview of the existing research in indoor data management. Finally, we discuss the future research direction in this important and growing research area.

  • Distance-based Multimedia Indexing

    Christian Beecks - RWTH Aachen University
    Merih Seran Uysal - RWTH Aachen University
    Thomas Seidl - RWTH Aachen University

    In this tutorial, we aim at providing a unified and comprehensive overview of the state-of-the-art approaches to distance-based multimedia indexing. We intend to cover a broad target audience starting from beginners to experts in the domain of distance-based similarity search in multimedia databases and adjacent research fields which utilize distance-based approaches. No prerequisite knowledge is needed.

    We begin with outlining different approaches to object representations including the feature extraction process and suitable feature representation models as well as clustering-based computations in order to answer the question of how to model multimedia data objects in a compact and generic way. In the second part of this tutorial, we present state-of-the-art similarity and dissimilarity measures including kernels and distance functions in order to complete our understanding of a similarity model. The third part is devoted to approaches for efficient query processing. After introducing similarity queries, we show how to process such queries efficiently by means of multi-step filter-and-refinement algorithms and lower bounding. The last part finally covers indexing approaches for distance-based similarity models where we discuss the fundamentals of spatial indexing, high-dimensional indexing, as well as metric and ptolemaic indexing.

Friday 23rd October

  • Algorithm Design for MapReduce and Beyond
    Sergei Vassilvitskii - Google
    Grigory Yaroslavtsev - University of Pennsylvania

    The MapReduce style of parallel processing has made certain operations nearly trivial to parallelize - Word Count is the canonical "Hello World" example. Still, parallelization of many problems, e.g., computing a good clustering, or counting the number of triangles in a graph, requires effort; since straight forward approaches yield almost no speedups over a single machine implementation. This tutorial will cover recent results on algorithm design for MapReduce and other modern parallel architectures. We begin with an overview of the framework, and highlight the challenge of avoiding communication and computational bottlenecks. We then introduce a toolkit of algorithmic strategies for dealing with large datasets using MapReduce. The goal of most of these approaches is to reduce the data size (from petabytes and terabytes to gigabytes and megabytes), while preserving its structure relevant to the problem of interest. Sketching, composable coresets, and adaptive sampling all fall into this category of approaches. We then turn to specific applications to both showcase these techniques, and highlight recently developed practical methods. Our initial focus is on clustering, whose many variants form the core of data analysis. We cover the classic clustering methods, such as k-means, as well as more modern approaches like correlation clustering and hierarchical clustering. We then turn to methods for graph analysis, building up our intuition with algorithms for graph connectivity and moving onto graph decompositions, matchings, spanning trees and subgraph counting.

Gold Sponsor


Silver Sponsor


Bronze Sponsors


Academic Sponsors