Scientists Find Optimal Balance of Data Storage and Time

Seventy years after the invention of a data structure called a hash table, theoreticians have found the most efficient possible configuration for it.

Allison Li for Quanta Magazine

Introduction

About 70 years ago, an engineer at IBM named Hans Peter Luhn quietly changed the course of computer science. Luhn already held several patents, including one for a device that could measure a cloth’s thread count and another for a guide that determined what mixed drinks you could make from the ingredients in your kitchen. But in a 1953 internal IBM paper, he proposed a new technique for storing and retrieving information that is now built into just about all computational systems: the hash table.

Hash tables are a major class of data structures. They offer an especially convenient method for accessing and altering information in massive databases. But this technology comes with an unavoidable trade-off.

In a 1957 paper published in the IBM Journal of Research and Development, W. Wesley Peterson identified the main technical challenge that hash tables pose: They need to be fast, meaning that they can quickly retrieve the necessary information. But they also need to be compact, using as little memory as possible. These twin objectives are fundamentally at odds. Accessing and modifying a database can be done more quickly when the hash table has more memory; and operations become slower in hash tables that use less space. Ever since Peterson laid out this challenge, researchers have tried to find the best balance between time and space.

Computer scientists have now mathematically proved that they have found the optimal trade-off. The solution came from a pair of recent papers that complemented each other. “These papers resolve the long-standing open question about the best possible space-time trade-offs, yielding deeply surprising results that I expect will have a significant impact for many years to come,” said Michael Mitzenmacher, a computer scientist at Harvard University who was not involved in either study.

“I would definitely say it is a big deal,” added Rasmus Pagh, a computer scientist at the University of Copenhagen. “A lot of people have worked on this problem, trying to see how much you can squeeze space, while also having time-efficient operations. This is the one I would have loved to solve.”

Making a Hash of It

Hash tables are among the oldest, simplest, fastest and most widely used data structures today. They’re designed to perform three basic operations: insertions, which add new items to the database; queries, which access an item or check to see whether it exists; and deletions. A hash table can be ephemeral — existing only as long as a particular program runs — or it can be a permanent part of your computer’s operating system. A web browser such as Chrome or Safari may have multiple built-in hash tables intended to keep track of different kinds of data.

Entries in a hash table are stored as pairs, with the item — the information itself — connected to a key that identifies the information. Plug a key into a hash table’s query algorithm, and it takes you directly to the item. This may not sound so extraordinary, but for enormous databases it can be a great time-saver.

Black and white photo of Hans Peter Luhn in a suit, reading a piece of paper.

In 1953, Hans Peter Luhn suggested a new way to store and retrieve information called the hash table.

Courtesy of IBM Corporation

Introduction

To take an extremely simplified example, consider the Oxford English Dictionary, which has definitions for more than 600,000 words. If a digital edition relies on a hash table, you can simply use a given word as a key and proceed straight to the definition. Without a hash table, the dictionary would likely rely on a much slower search mechanism, using a process of elimination to eventually converge on the requested definition. And while a hash table can find any word in a constant amount of time (usually a tiny fraction of a second), the search time for other methods can go up as the number of words in the dictionary increases. A hash table also offers another advantage: It can keep the dictionary dynamic, making it easy to insert new words and delete outdated ones.

Researchers have spent decades building hash tables that attempt to maximize speed and minimize memory. In the 20th century, solutions tended to offer significant gains in just one aspect, time or space. Then in 2003, researchers showed that it was theoretically possible to make a major efficiency leap in both time and space simultaneously. It would take another two decades, however, for researchers to figure out the ideal balance between the two.

The Data Shuffle

The first major step toward that goal came in 2022 at a major computer science conference in Rome. There, a team proposed a hash table with new features that could deliver the best combination of time and space efficiency yet conceived. The first author of the paper (listed alphabetically) was Michael Bender of Stony Brook University, so it is commonly referred to as the Bender et al. hash table. While the team didn’t try to build a functioning hash table, they proved that it could, in principle, be constructed with the features they described.

To evaluate the hash table they came up with, the group produced a trade-off curve — a graph that plots the time per operation (insertion or deletion) on one axis and the space taken up by memory on the other. But this graph defines space in a special way: Because of how they’re built, hash tables need more memory than just the bare minimum required to store a given set of items. Computer scientists call this extra space “wasted bits,” even though they’re not really wasted and are, to some extent, necessary. The space axis on a trade-off curve measures the number of wasted bits per key.

By analyzing a trade-off curve, researchers can figure out the fastest time possible for a hash table that uses a given amount of space. They can also flip the question around to figure out the smallest possible space for a given operation time. Usually, a small change in one variable will lead to a small change in the other, said William Kuszmaul, a theoretical computer scientist at Harvard and a co-author of the 2022 paper. “If you double the time, perhaps you’ll halve the number of wasted bits per key.”

But that’s not the case with the hash table they designed. “If you increase the time by a little bit, the wasted bits per key decrease exponentially,” Kuszmaul said. The trade-off curve was so steep, it was literally off the charts.

William Kuszmaul in a checkered shirt stands in front of a whiteboard with figures on it.

William Kuszmaul helped design a hash table in 2022 that seemed incredibly efficient in both memory and speed.

Rose Silver

Introduction

The team built their hash table in two parts. They had a primary data structure, in which the items are stored without any wasted bits at all, and a secondary data structure, which helps a query request find the item it’s looking for. While the group did not invent the notion of a secondary data structure, they did make a crucial discovery that made their hyperefficient hash table possible: The structure’s overall memory efficiency depends on how the primary structure arranges its stored items.

The basic idea is that every item in the primary structure has preferred storage locations — a best location, a second-best one, a third best and so on. If an item is in its best spot, the number 1 is affixed to it, and that number is stored in the secondary data structure. In response to a query, the secondary structure provides just the number 1, which spells out the item’s exact location in the primary structure.

If the item is in its 100th-best spot, the secondary data structure attaches the number 100. And because the system uses binary, it represents the number 100 as 1100100. It takes more memory, of course, to store the number 1100100 than 1 — the number assigned to an item when it’s in the best spot. Differences like that become significant if you’re storing, say, a million items.

So the team realized that if you continually shift items in the primary data structure into their more preferred locations, you could significantly reduce the memory consumed by the secondary structure without having to increase query times.

“Before this work, no one had realized you could further compress the data structure by moving information around,” Pagh said. “That was the big insight of the Bender paper.”

The authors showed that their invention established a new upper bound for the most efficient hash tables, meaning that it was the best data structure yet devised in terms of both time and space efficiency. But the possibility remained that someone else might do even better.

Bound to Succeed

The next year, a team led by Huacheng Yu, a computer scientist at Princeton University, tried to improve the Bender team’s hash table. “We worked really hard and couldn’t do it,” said Renfei Zhou, a student at Tsinghua University in Beijing and a member of Yu’s team. “That’s when we suspected that their upper bound was [also] a lower bound” — the best that can possibly be achieved. “When the upper bound equals the lower bound, the game is over, and you have your answer.” No matter how clever you are, no hash table can do any better.

Yu’s team employed a novel strategy to find out if that hunch was correct by calculating a lower bound from first principles. First, they reasoned that to perform an insertion or a deletion, a hash table — or, really, any data structure — must access the computer’s memory some number of times. If they could figure out the minimum number of times needed for a space-efficient hash table, they could multiply that by the time required per access (a constant), giving them a lower bound on the runtime.

But if they didn’t know anything about the hash table (except that it was space-efficient), how could the researchers figure out the minimum number of times required to access the memory? They derived it purely from theory, using a seemingly unrelated field called the theory of communication complexity, which studies how many bits are required to convey information between two parties. Eventually, the team succeeded: They figured out how many times a data structure must access its memory per operation.

Renfei Zhou standing outdoors.

Renfei Zhou was part of a team that proved the 2022 hash table was as efficient as any data structure could possibly be.

Wei Yu

Introduction

This was their key achievement. They were then able to establish a lower bound on the runtime for any space-efficient hash table. And they saw that it matched the Bender hash table exactly. “We thought [at first] it could be improved,” Zhou said. “It turned out we were wrong.” That, in turn, meant that Peterson’s problem had finally been solved.

Besides answering the decades-old question, Kuszmaul said, the amazing thing about the Yu proof is its generality. “Their lower bound applies to all possible data structures, including ones that have not been invented yet.” That means no method of data storage can ever beat the Bender hash table in terms of memory and speed.

Hashing Into the Future

Despite the new hash table’s unprecedented efficiency, no one is likely to try building it anytime soon. It’s just too complicated to construct. “An algorithm that is fast in theory is not necessarily fast in practice,” Zhou said.

It’s not unusual for such gaps between theory and practice to persist for a long while, Kuszmaul said, because theorists tend to ignore constant factors. The time it takes to perform an operation is typically multiplied by a number, some constant whose exact value may be immaterial from a theoretical standpoint. “But in practice, constants really matter,” he said. “In the real world, a factor of 10 is a game ender.”

Actual hash tables are still improving in material ways, even if they’re falling far short of the theoretical ideal. For example, a new hash table called IcebergHT, built by Bender, Kuszmaul and others, is far better than its predecessors. According to Kuszmaul, it’s twice as fast as the most space-efficient hash table available today, and it uses three times less space than the fastest hash table.

Mitzenmacher hopes that the 2023 result may soon yield another kind of benefit: “Whenever you get a new lower bound — especially one that involves some new techniques — there is always hope that you could use them … for related problems.”

There’s also the intellectual satisfaction that comes from knowing you have solved a difficult and long-standing problem, said the computer scientist Piotr Indyk of the Massachusetts Institute of Technology. “Once you’re sure that certain data structures cannot be improved upon, that can help focus the research effort.” Finally, data researchers can turn their attention away from Peterson’s challenge and focus on new problems in theoretical computer science, of which there is no shortage.

Next article

Maze Proof Establishes a ‘Backbone’ for Statistical Mechanics

Note: This article have been indexed to our site. We do not claim legitimacy, ownership or copyright of any of the content above. To see the article at original source Click Here

Related Posts
New Wreck Sunk In The Waters Off Belize thumbnail

New Wreck Sunk In The Waters Off Belize

A new wreck diving site has been established in the waters off Belize. The Witconcrete was sunk at Blackbird Caye, in the Turneffe Atoll Marine Reserve. The ship was one of the largest ever deployed to the Caribbean, coming in at an impressive 375ft/114m. The Witconcrete was built between 1942 and 1944, and during its…
Read More
USA testen Roboterhunde für den Grenzschutz thumbnail

USA testen Roboterhunde für den Grenzschutz

© DHS Produkte 03.02.2022 Die Maschinen sollen autonom Erkundungs- und Kontrollgänge an der südlichen Grenze durchführen. Das Ministerium für Innere Sicherheit der Vereinigten Staaten (DHS) hat Daten zu geplanten Roboterhunden herausgegeben, die für den Grenzschutz im Süden des Landes zum Einsatz kommen sollen. Die Maschinen sollen dabei unterschiedliche Rollen zur Überwachung einnehmen, die aktuell getestet werden.  „Die südliche…
Read More
スターラックス航空、福岡就航 台北から週1往復「貨物で実績重ねたい」 thumbnail

スターラックス航空、福岡就航 台北から週1往復「貨物で実績重ねたい」

By Yusuke KOHASE  台湾の新興航空会社スターラックス航空(星宇航空、SJX/JX)は2月17日、台北(桃園)-福岡線を開設した。当初は週1往復で開設し、観光需要の回復後は1日1往復のデイリー運航への増便を計画する。福岡への就航により、同社の日本路線は3路線に拡大した。また福岡への国際新路線は、新型コロナウイルスが流行した2020年4月以降では初となった。福岡空港への到着を放水アーチの歓迎されるスターラックスの台北発JX840便初便=22年2月17日 PHOTO: Yusuke KOHASE/Aviation Wire —記事の概要— ・貨物需要が堅調 ・パイロット資格持つ会長 貨物需要が堅調 スターラックスの謝支社長=22年2月17日 PHOTO: Yusuke KOHASE/Aviation Wire  スターラックスの台北-福岡線は木曜のみの週1往復運航する。スケジュールは、福岡行きJX840便が台北を午前10時30分に出発し、午後1時45分着。台北行きJX841便は午後2時45分に福岡を出発し、午後4時25分に到着する。機材はエアバスA321neo(2クラス188席:ビジネス8席、エコノミー180席)を投入する。  初便となった17日の福岡行きJX840便(A321neo、登録記号B-58204)は、午前10時13分に台北を出発。放水アーチの歓迎を受け午後1時25分に福岡空港の55番スポット(駐機場)へ到着した。折り返しのJX841便は午後2時32分に福岡を出発した。  現在は新型コロナの影響により、旅客の入国が大きく制限されている。スターラックス日本支社の謝亦勤(シャイーチン)支社長は、コロナ禍での日本路線開設について「スターラックスは日本路線を積極的に展開する戦略」とした上で、「旅客は厳しいが堅調な貨物需要で実績を重ねていきたい」と述べた。  謝支社長によると、台北発の初便は3人、福岡発は6人が利用したという。一方で「貨物は満載」(謝支社長)なことから、当面は貨物需要を福岡線の柱にすると説明した。  今後の増便については「ビジネス渡航が緩和されてから検討する」とし、観光需要の回復後には1日1往復への増便も計画する。  福岡への就航により、日本路線は3路線となった。今後の日本路線計画について謝支社長は「検討している最中」とした上で、「候補としては札幌と仙台、中部、那覇。まずは主要都市に就航する」と述べた。主要都市への路線開設後は「チャーター便などで地方都市への乗り入れも試したい」とさらなる路線拡大に意欲を示した。福岡空港を出発したスターラックスの台北行きJX841便初便=22年2月17日 PHOTO: Yusuke KOHASE/Aviation Wire パイロット資格持つ会長  スターラックスは2018年5月に、エバー航空(EVA/BR)で会長を務めた張國煒氏が設立したフルサービス航空会社(FSC)で、2020年1月23日に就航。台北-マカオ、ダナン、ペナンの3路線を同時開設した。日本就航は同年12月15日の関西線、翌16日の成田線に続き、福岡が3路線目となる。関空と成田への初便は、パイロット資格を持つ張会長が自ら操縦した。  福岡線の初便について、謝支社長は「私は地上にいるので、張会長が操縦したかどうかは分からない」と述べるに留めた。  日本路線以外ではマカオとペナンのほか、バンコクとシンガポール、クアラルンプール、ホーチミン、マニラに乗り入れている。  現在の機材はA321neoで、「自宅の心地良さ」をテーマとするビジネスクラスはフルフラットシートで、個人用モニターが15.6インチ。エコノミークラスは10.1インチで、シートピッチは31インチ(78.7センチ)となる。またA321neoに加え、A350とA330neoの導入も計画する。 *写真は14枚。 福岡空港へ着陸するスターラックスの台北発JX840便初便=22年2月17日 PHOTO: Yusuke KOHASE/Aviation Wire 福岡空港へ着陸したスターラックスの台北発JX840便初便=22年2月17日 PHOTO: Yusuke KOHASE/Aviation Wire 放水アーチのくぐり福岡空港に到着するスターラックスの台北発JX840便初便=22年2月17日 PHOTO: Yusuke KOHASE/Aviation Wire 放水アーチのくぐり福岡空港に到着するスターラックスの台北発JX840便初便=22年2月17日 PHOTO: Yusuke KOHASE/Aviation Wire 横断幕を掲げ福岡就航を祝うスターラックスの謝支社長(右から2人目)ら=22年2月17日 PHOTO:…
Read More
Index Of News
Total
0
Share