WOS4 : Workshop on Storage and Processing of Big Data

Loading Map....

Date(s) - 03/12/2014 - 04/12/2014
All Day



The fourth Inria-Technicolor Workshop on Storage and Processing of Big Data is an open event bringing together experts in the area of distributed systems and interested in the storage and the processing of large datasets.

The event was held in Technicolor at Cesson-Sevigné on December 3rd and 4th, 2014.




Speaker List
Stéphan Clémençon (Telecom Paristech)
Aleksandar Dragojevic (Microsoft Research)
Patrick Laffitte (Kila Systems)
Anne-Cécile Orgerie (CNRS/IRISA)
Etienne Riviere (University of Neuchatel)
Nicolas Le Scouarnec (Technicolor)
Gael Thomas (Telecom SudParis)
Patrick Valduriez (Inria Sophia-Antipolis)
Jon Walkenhorst (Technicolor)
December, Wednesday 3rd
9h30-9h45 Opening
9h45-10h30 Aleksandar Dragojevic- FaRM Fast Remote Memory
11h-12h30 Patrick Laffitte – Enriching metadata from user behavior, at Kila Systems
Nicolas Le Scouarnec – Efficient codes for storage

14h00 – 15h30 Patrick Valduriez – CloudMdsQL: Querying Heterogeneous Cloud Data Stores with a Common Language
Stéphan Clemençon – Survey schemes for scaling-up machine learning
16h-17h40 Student Talks
Dimitri Pertin (IRCCyN) – RozoFS: A Distributed File System based on Erasure Coding for I/O Intensive Workloads
Luis Eduardo Pineda Morales (Inria/Microsoft Research) – Multisite metadata management for geographically distributed cloud workflows
Fabien André (Technicolor) – Indexing of large scale vectors
Pierre Meye (Orange) – A secure two-phase data deduplication scheme

17h40-18h00 First day closing

19h45 Dinner in the center of Rennes, at l’Amiral HERE

December, Thursday 4th
9h-10h30 Anne-Cecile Orgerie – Toward energy-efficient cloud computing
Jon Walkenhorst – Hollywood and the next generation of Production as a Service
11h-12h30 Etienne Rivière  – Towards efficient confidentiality-preserving content-based publish/subscribe
Gael Thomas – NUMAGiC: an efficient garbage collector for NUMA machines

12h30-12h45 Workshop closing



 Talks and abstracts
Stéphan Clémencon – Survey schemes for scaling-up machine learning
In certain situations that shall be undoubtedly more and more common in the Big Data era, the datasets available are so massive that computing statistics over the full sample is hardly feasible, if not unfeasible. A natural approach in this context consists in using survey schemes and substituting the “full data” statistics with their counterparts based on the resulting random samples, of manageable size. It is the main purpose of this talk to describe a few striking theoretical results on the impact of survey sampling on empirical risk minimization. In particular, we investigate the gain in using sampling schemes with unequal inclusion probabilities for (stochastic) gradient descent-based M-estimation methods in large-scale statistical and machine-learning problems. Precisely, we prove that, in presence of some a priori information, one may signicantly reduce the number of terms that must be averaged to estimate the gradient at each step with overwhelming probability, while preserving the asymptotic accuracy. These results are described here by rate bounds, limit theorems and are also illustrated by numerical experiments.

Aleksandar Dragojevic – FaRM Fast Remote Memory
I will talk about the design and implementation of FaRM, a new main memory distributed computing platform that exploits RDMA communication to  improve both latency and throughput by an order of magnitude relative to state of the art main memory systems that use TCP/IP. FaRM exposes the memory of machines in the cluster as a shared address space. Applications can allocate, read, write, and free objects in the address space. They can use distributed transactions to simplify dealing with complex corner cases that do not significantly impact performance. FaRM provides good common-case performance with lock-free reads over RDMA and with support for collocating objects and function shipping to enable the use of efficient single machine transactions. FaRM uses RDMA both to directly access data in the shared address space and for fast messaging and is carefully tuned for the best RDMA performance. We used FaRM to build a key-value store and a graph store similar to Facebook’s. They both perform well, for example, a 20-machine cluster can perform 160 million key-value lookups per second with a latency of 31 micro-seconds.

Patrick Laffitte – Enriching metadata from user behavior, at Kila Systems

Anne-Cécile Orgerie – Toward energy-efficient cloud computing
For the last few years, Cloud computing has emerged as an attractive technology leading towards new Internet usages. Its rapid expansion led to a fast growth of the number of datacenters used to power these non-virtual Clouds. Along with this growth comes a consequent increase in the electricity used by these systems. As this rising energy consumption is concerning and hampering Cloud expansion, energy-efficient techniques have been developed. In this talk, I will present the different techniques used to make Clouds greener at each level: from the hardware and infrastructure layers to the platform and software layers. Finally, I will propose some validations of these techniques and some ideas to further improve energy-efficiency.

Etienne Rivière – Towards efficient confidentiality-preserving content-based publish/subscribe
Content-based publish/subscribe provides a loosely-coupled and expressive form of communication for large-scale distributed systems. Confidentiality is a major challenge for publish/subscribe middleware deployed over multiple administrative domains. Recent advances on encrypted processing led to the design of privacy-preserving content-based filtering techniques. These techniques  allow matching (encrypted) subscriptions against incoming (encrypted) publications, without revealing their content to the brokers. I will overview a state-of-the-art technique, ASPE. Encrypted filtering impose high performance penalties compared to regular, plaintext filtering. I will present recent work on prefiltering techniques, that allow bridging this performance gap by augmenting encrypted subscriptions and publications with compact structures. The prefiltering operator greatly reduces the amount of encrypted subscriptions that must be matched against incoming encrypted publications, and can take advantage of containment between subscriptions when available. The security analysis indicates that the robustness of the base scheme is preserved while performance evaluation indicates that the cost of filtering is greatly reduced. Finally, we will consider a practical aspect of the deployment of confidentiality-preserving publish/subscribe using encrypted filtering, the problem of key updates, and present an innovative approach allowing direct re-encryption of subscriptions in untrusted domains.

Gael Thomas NUMAGiC: an efficient garbage collector for NUMA machines

Patrick Valduriez – CloudMdsQL: Querying Heterogeneous Cloud Data Stores with a Common Language
The blooming of different cloud data management infrastructures, specialized for different kinds of data and tasks, has led to a wide diversification of DBMS interfaces and the loss of a common programming paradigm. The CoherentPaaS project addresses this problem, by providing a common programming language and holistic coherence across different cloud data stores. In this talk, I will present the design of a Cloud Multi-datastore Query Language (CloudMdsQL), and its query engine. CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores (relational and NoSQL) within a single query that may contain embedded invocations to each data store’s native query interface. Thus, CloudMdsQL unifies a quite diverse set of data management technologies while preserving the expressivity of their local query languages. Our experimental validation, with three data stores (graph, document and relational) and representative queries, shows that CloudMdsQL satisfies the five important requirements for a cloud multidatabase query language.

Jon Walkenhorst – Hollywood and the next generation of Production as a Service
For the last 100 years Technicolor has played a critical role in the technical evolution of movie production from the invention of new cameras, to color film, and film development before branching in to editing and distribution.   By the late 2000’s the use of film had given way to an all-digital process from camera to projector.  Today film production touches dozens of teams and companies as the process goes from on-location to distribution and eventually archive.  With the growing availability and capabilities of cloud, a new workflow is taking shape.  As service companies migrate their special sauce from proprietary manual process to software based tools; a new software as a service industry is taking shape.  When combined together a series of SaaS “glued” together in the cloud, we are seeing the beginning of a new service industry, one that is driven by talent and experience and not by old school processes.  This new world of production in a cloud is becoming the next generation of Production as a Service and will allow an all new generation of “film makers” to take advantage of the production tools previously only available to large studios and productions.


Student Talks and abstracts

Fabien André – TBD

Pierre Meye – A secure two-phase data deduplication scheme
A study by the IDC Information (“The digital universe decade. Are you ready?”) reports that data grows at the impressive rate of 50% per year, and 75\% of the digital world is a duplicate. Although keeping multiple copies of data is necessary to guarantee their availability and long term durability, in many situations the amount of data redundancy is  immoderate. By keeping a single copy of repeated data, data deduplication is  considered as one of the most promising solutions to reduce the storage costs, and improve users experience by saving network bandwidth and reducing backup time. However,  several works have recently revealed important security issues leading to information leakage to malicious clients. These security concerns arise especially in cloud storage systems performing an inter-user and client-side deduplication which is unfortunately the kind of deduplication that provides the best savings in terms of network bandwidth and storage space. In this presentation we present a simple solution secure against the attacks from malicious clients that are based on the manipulation of data identifiers and those based on backup time and network traffic observation.

Luis Eduardo Pineda Morales – Multisite metadata management for geographically distributed cloud workflows
Since scientific data can reach huge sizes and need to support fine-grain data stripping, metadata becomes a critical issue. Many distributed file systems reach their limits because they use a centralized metadata management scheme. Thus, we argue for a new, cloud-based, distributed approach. Our goal is to design and evaluate different approaches to geographically distributed metadata management; in order to provide a uniform metadata handling tool for scientific workflow engines across cloud datacenters. And ultimately derive a cost model to offer users the best trade-off (performance vs. cost) driven by their constraints.

Dimitri Pertin  – RozoFS: A Distributed File System based on Erasure Coding for I/O Intensive Workloads

Conditions and location:

Registration is free but mandatory (see below)

Contact: wosmanagement@technicolor.com




975 avenue des champs blancs
35576 Cesson Sévigné, France

(location on Google maps)

Bus lines 1, 35, and 50 (5mn walk)

From city center of Rennes, two options:

  • take the bus “1” to the end “Champs Blancs” (line map). Technicolor building is just in front of the bus stop. There is a bus stop for line 1 at the train station in Rennes city center.
  • take the bus “50” at “République”, and to stop at “Château de Vaux” (line map). Then walk in the direction of the roundabout, and just before reaching it, turn left in direction of the Technicolor building (5 mn walk total). Bus schedule here.