thesis/real_related_work.tex at master · Cotes/thesis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
\chapter{Related work}

\emph{In this chapter, we study the state-of-the-art of Personal Cloud. First of all,
we review a brief history of the Personal Cloud term. Later, we discuss about recent
publications about measuring and benchmarking Personal Clouds. We end this chapter
explaining related work on synchronization algorithms.}

\section{A Brief History of the Personal Cloud Term}

In the past years there have been divergent views of the ``Personal Cloud'' concept.
For example, in \cite{hari2012personal} authors propose an architecture and design
for accessing and sharing computational resources in virtual machines. For them, a
Personal Cloud is a collection of Virtual Machines running on unused computers at the edge.
Another different view focuses on collaborative work \cite{ardissono2009service},
where a web infrastructure is defined to provide a unified environment
for handling activities and collaborations. Finally, a recent trend \cite{windley}
 goes further and defines the Personal Cloud as a cloud Operating
System that offers a core set of services around identity, trust, data access and
even programming models.

In this work, we focus on Personal Cloud Storage platforms that take care of data
sync and sharing from heterogeneous devices. In fact,  the term ``Personal Cloud'' have
received a lot of attention with the recent research reports from Forrester \cite{forrester}
and Gartner \cite{gartner}. Like us, these reports associate the term Personal Cloud with
online cloud storage services such as Dropbox, Box, or Google Drive among others.

\section{Measurements and benchmarks}
The performance evaluation of Cloud services is a
current hot topic with several papers appearing
recently~\cite{variability_ccgrid11, nature_data_center_traffic}. A large corpus
of works have focused on measuring distinct aspects of a Cloud service,
such as computing performance~\cite{ec2benchmarking}, service
variability~\cite{variability_ccgrid11}, and the nature of
datacenter traffic~\cite{nature_data_center_traffic}, among
others issues.

Unfortunately, only few works have turned attention to measure the performance
of Cloud storage services. To wit, the authors in~\cite{early_experiences_azure}
explore the performance of Microsoft Azure,
including storage. In this line, the authors in~\cite{s3_grids} execute an extensive
measurement against Amazon S3 to elucidate whether Cloud storage is suitable
for scientific Grids or not.
Similarly, \cite{bergen2011client} presents a performance
analysis of the Amazon Web Services, with no
insights regarding Personal Clouds.

File hosting and file sharing Cloud services have been analyzed in depth
by several works \cite{imc_rapidshare, one_click_hosting}.
They provide an interesting substrate to understand both
the behavior of users and the QoS of major providers (e.g. RapidShare).

The comparison of public Clouds have recently aroused much
interest. Concretely, the authors in \cite{cloudcmp} present CloudCmp, a systematic
performance and cost comparator of Cloud providers.
The authors validate CloudCmp by measuring the elastic computing,
storage, and networking services offered by several
public Clouds based on metrics reflecting the impact
on the performance delivered to customer applications.

In a similar fashion, the authors of \cite{hu2010good} compare Dropbox, Mozy,
Carbonite and CrashPlan backup services. However, their analysis of
the performance, reliability and security levels is rather lightweight, and
more work is needed to characterize these services with enough rigour.

Recently, the authors in \cite{drago2012inside} presented an
extensive measurement of DropBox in two scenarios: in a university
campus and in residential networks. They analyzed and characterized the
traffic transmitted by users, as well as the
functioning and architecture of the service. Instead of analyzing the behavior
of users using a specific Personal Cloud, we focused on
characterizing the service that many providers offer.

\section{Synchronization Algorithms}

At the core of Personal Cloud is file sync. Although a rush of online file sync services have
been entering the market during the last years, evidenced by the explosion in popularity of
Dropbox and competitors, little is known about the design and implementation of commercial
sync protocols. According to a recent characterization of Dropbox~\cite{drago2012inside}, file synchronization
is built upon third-party libraries such as librsync, but the role of this library is uncertain
because of the very nature of the rsync algorithm~\cite{tridgell96rsync}. Other popular tools like
unison~\cite{unison} that use the same basic algorithm suffer from the same deficiencies.

rsync is symmetric, and provides pairwise synchronization between two devices, where the rsync
utility running on each computer must have local access to the entire file. This requirement
poses the first practical limitation to the adoption of rsync because working at the file level
prevents efficient data deduplication. To save storage space and money, services like Dropbox
split files into chunks and store them at multiple nodes on the server side. A straightforward
adaptation of rsync to this context would be piecing together the chunks and reconstructing
whole file at a chosen server and then operate on it. This, unfortunately, would waste massive
intra-cluster bandwidth, deteriorating significantly deduplication efficiency. It is worth
noting here that, although rsync finds chunks of data that occur both in the old file and the
new file, it requires the side acting as a server to compute hashes for all possible alignments
in its file in order to find a match. For this, it needs the whole file.

In addition, if a single character is modified in each chunk of the old file, then no match
will be found by the server and rsync will be completely useless~\cite{langford01}. To address this limitation,
a number of single-round and multi-round protocols have been proposed in the last ten years.
Multi-round protocols allow communicating fewer bits in total by using additional communication
rounds; see~\cite{langford01} and~\cite{suel04}. However, the fact of taking multiple passes over files presents evident
disadvantages in terms of protocol complexity, computing and I/O overheads, and communication
latency. Recent single-round protocols~\cite{irmak05}\cite{hao08} bypass this difficulty by using variable-length
content-based chunking~\cite{Muthitacharoen01}. However, since these protocols only synchronize files between two
different machines at a time, they are not directly applicable to Personal Clouds, where file
changes occurring elsewhere are automatically notified to any other device sharing that file.

There is a large body of work by the OS community that attempts to detect redundancies in order
to reduce storage or transmissions costs, like LFBS~\cite{Muthitacharoen01} and Pastiche~\cite{Cox02}, among others, which
have inspired in one way or another many of today's online cloud storage services. These systems
operate at block level by relying on variable-length content-based chunking, rather than at
file level. Compared with personal cloud applications, distributed file systems pursue a
different objective and can skip implementing some basic functionality in personal clouds like
file version management or the scalable notification of updates as soon as they occur. In fact,
LBFS uses leases in which the server's obligation to inform a client of changes expires after one
minute. The result is that the client will be out of sync once a lease on a file has expired.
In Dropbox, however, any change on the central storage is advertised as soon as it is made~\cite{drago2012inside}.
For this purpose, the Dropbox client keeps continuously opened a TCP connection to a notification
server, used for receiving information about changes performed elsewhere.

Overall, \textit{providing an efficient and scalable sync protocol} poses a grand challenge for
engineers in charge of building Personal Cloud storage services, due to the intricate
relationships between deduplication, notification and metadata management.