Doctoral thesis

Česky

Title:

The Online Labeling Problem

Author:

Jan Bulánek, Ph.D.

Advisor:

Doc. Mgr. Michal Koucký, Ph.D.

Defense:

Prague, 2014

Institute:

Institute of Mathematics of the Academy of Sciences of the Czech Republic

Institute:

Department of Theoretical Computer Science and Mathematical Logic

Download

Application: Application.pdf

Thesis: Thesis.pdf

Extended abstract: ExtendedAbstract.pdf

Curriculum vitae: cv.pdf

Review of Doc. Mgr. Michal Koucký Ph.D. (advisor): AdvisorReview.pdf

Review of Prof. Gerth Stolting Brodal (referee): BrodalReview.pdf

Review of Prof. John Iacono (referee): IaconoReview.pdf

Abstract

A sorted array is a fundamental algorithmic concept. Its on-line variant gives rise to the online labeling problem. In the online labeling problem we are given an array of size m and a stream of n integers from the universe {1,..., r} coming in an arbitrary order. Our task is to maintain all received items in the array in sorted order. The inserted items do not have to be stored consecutively in the array. Since the final order of the items is not known until we see all the items, moves of already inserted items are allowed but should be minimized.

Algorithms with amortized cost O(log2(n)) per inserted item for the most important case of m=O(n) are known since the early eighties. We show that these algorithms are optimal. Algorithms for certain other ranges of m were also known previously. We extend them to virtually all possible ranges of m and establish the optimality of all these algorithms as well. All the known algorithms are oblivious to the relationship between the universe size r and the array size m. Conceivably some algorithm could benefit from r being comparable to m. We rule out this possibility for certain ranges of m in particular, for the linear case. This issue was not addressed previously. Some of our tight lower bounds also apply to randomized algorithms. For these cases our results imply that the expected time for randomized algorithms is same as for deterministic ones.

Publications

1.  J. Bulánek, M. Koucký, M. Saks
On Randomized Online Labeling with Polynomially Many Labels
International Colloquium on Automata, Languages and Programming, ICALP’13 , pp. 291-302, 2013.

2.   Z. Falt , J. Bulánek, J. Yaghob
On Parallel Sorting of Data Streams
17. Advances in Databases and Information Systems, ADBIS’ 12 , pp. 69-77, 2012.

3.   M. Babka, J. Bulánek, V. Čunát, M. Koucký, M. Saks,
On online labeling with polynomially many labels
20th Annual European Symposium on Algorithms, ESA'12, pp. 121-132, 2012.

4.   J. Bulánek, M. Koucký, M. Saks      
Tight lower bounds for online labeling problem
44th Annual ACM Symposium on Theory of Computing, STOC'12, pp. 1185-1198, 2012.

5.   J. Bulánek
Prefix vs. unordered bucketing
In preparation.

6.   J. Bulánek
A note on list coloring of squares of cycles
Manuscript.

Informal overview

The construction of effective algorithms and data structures is the central focus of computer science since its very beginning. As the amount of data we need to process increases this is an important topic even though the computational power of our devices is increasing. Therefore, we need to design the best algorithms or data structures possible. To do so we need to be able to recognize what is the best possible solution. Thus, we need to prove lower bounds on the best possible cost of algorithms and data structures.

 

In this thesis we do both. We design algorithms and prove lower bounds on their cost. For a particular problem called the online labeling problem we not only present a new algorithm which is superior to known algorithms for certain input sizes, but we also prove matching general lower bounds.

 

The online labeling problem is tightly connected to sorting. Sorting is no doubt one of the fundamental problems of computer science. We know the optimal time complexity of sorting. Sorted arrays are easy to use, provide asymptotically optimal search time and utilize memory caches well during scan query.

A natural question is, whether inserts of additional items in a sorted array can be implemented efficiently. There are immediate applications of such a structure in algorithm and data structure design, e.g., dynamic sorted arrays, priority queues, dictionaries etc.

Beyond direct algorithmic applications the problem is central also in other areas. Emek and Korman established a connection between the online labeling problem and the distributed controller problem. In distributed controller problem, nodes in an asynchronous distributed network receive requests from outside the network for units of a limited resource, and issue usage permits in response to the requests. A combination of their and our results implies the tight lower bound for distributed controller problem.

Problem description

You are given an array of size m and a stream of n integers coming in an arbitrary order where n<m. Your task is to maintain all already received items in the array in sorted order. The inserted items do not have to be stored consecutively in the array. Since the final order of the items is not known until we see all the items, moves of already inserted items are allowed but should be minimized.

Let us describe one insert of some algorithm which solves the online labeling problem. The insert of one integer consists of three steps.


Consider the following setting of the array and the stream of integers as an input (which is unknown to the algorithm) obtained after certain number of inserts. As you can see the array is sorted with some gaps between non-empty cells.

Step 1.

The algorithm is provided the next integer to be inserted. It determines the intended position of the integer in the array. Notice that there is no empty cell for number 12. Actually, it can be shown that unless m>2n or the input stream is known in advance we cannot ensure an empty cell for each integer from the input.

Step 2.

In the next step the algorithm moves (arbitrary) integers already stored in the array so that it frees some cell for the newly arrived integer. These moves are what the algorithm is charged for. In this particular example the cost of the insert will be 4 since 3 integers were moved (we do not care about the distance the integer is moved) and one item was inserted.

Step 3.

The newly arrived integer is then inserted into the array and we can continue with the next integer 14.

The obvious solution to the online labeling problem moves O(n) items after each insert with the total complexity O(n2). This, however, does not benefit from the fact that the size of the array may be significantly larger than n.

This is one of several possible formulations of online labeling problem which is also known as the file maintenance problem. In the general formulation, items from a universe U of size r are assigned labels from a given linearly ordered universe of size m. The ordering of their labels must respect the ordering of items. Relabeling of items (which is equivalent to moving items in the array) is allowed, but should be minimized. This formulation is meaningful even for m greatly exceeding n.

Known algorithms

While O(n2) algorithm is obvious and straightforward the existence of fast and simple algorithm is rather surprising. It was first given at the beginning of the eighties by Itai, Konheim and Rodeh. The amortized time complexity was O(log2(n)) per inserted item, while the size of the array can be set to cn for any constant c>1. This surprising result was achieved by a clever utilization of the empty space in the array. Since then, several other algorithms were designed. The newest algorithm by Itai and Katriel from 2007 can be actually implemented on a few lines of code. The most complicated part of their algorithm is the binary search used to determine the position of a newly inserted integer. On the other hand one has to do some non-trivial analysis to establish the correctness and complexity of their algorithm.

Our contribution

For more than three decades it remained open whether the existing algorithms (none of them beat O(log2(n)) cost per inserted item) are asymptotically optimal. Especially the new applications in so called cache oblivious algorithms renewed the interest in this problem.

In our thesis we resolve this question by proving lower bounds matching the complexity of these algorithms. Indeed, we prove optimal lower bounds for almost all possible array sizes, in the case of linear size arrays we prove such lower bounds even under the assumption of limited universe r∊O(n) while all previous results assume r≥2n, and we also prove the first lower bound for randomized algorithms for the online labeling problem.

We also present the first algorithm for arrays of size nlog n up to 2n. Since we provide a matching lower bound this algorithm is asymptotically optimal. This result is meaningful in the general labeling formulation as the bit-size of each label is logarithmic in m.

Thus, we essentially completely determine the optimal complexity of deterministic algorithms for the online labeling problem.

Further directions

There are two possible further directions how to extend our results: Are there asymptotically better randomized algorithms for the online labeling problem? We know that in the case of polynomial size arrays the answer is negative. In the other direction one can ask whether it is possible to obtain better algorithms when the size of the universe from which the items are selected is comparable to the size of the array. We know that in the case of linear size arrays the answer is negative as well but what about larger arrays?

Acknowledgments

I would like to thank to all coauthors, namely Michal Koucký, Michael Saks, Martin Babka and Vladimír Čunát without whom this thesis and the respective results could not be obtained. This text, however, expresses my personal point of view and thus it may contain mistakes as well as it may miss some important aspects of our results.