summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorYorhel <git@yorhel.nl>2019-05-17 12:35:47 +0200
committerYorhel <git@yorhel.nl>2019-05-17 12:35:47 +0200
commit0e30dd4d999b9bbc4f0ac3f8675e843ce1b36a63 (patch)
treeed92e445dc89e50edc3136c9402769949166af1f
parent132f3745e13a164fd15a1017d2e73eed59eda5bc (diff)
dat/pwlookup.md: Some fixes and improvements
-rw-r--r--dat/doc/pwlookup.md57
1 files changed, 29 insertions, 28 deletions
diff --git a/dat/doc/pwlookup.md b/dat/doc/pwlookup.md
index 86085b6..3eaf7cf 100644
--- a/dat/doc/pwlookup.md
+++ b/dat/doc/pwlookup.md
@@ -52,11 +52,10 @@ could not find a way to quickly parse and iterate through that data in Perl 5
substring match on the full leaf block is faster than trying to work with
[unpack](https://perldoc.pl/functions/unpack) and
[substr](https://perldoc.pl/functions/substr), even if a naive substring match
-can't take advantage of the string ordering to skip over processing when it has
-found a "larger" key. So I made sure that the leaf block starts with a null
-byte and encoded the passwords as a sequence of null-terminated strings. That
-way we can reliably find the key by doing a substring search on
-`"\x00${key}\x00"`.
+can't take advantage of the string ordering to skip over data when it has found
+a "larger" key. So I made sure that the leaf block starts with a null byte and
+encoded the passwords as a sequence of null-terminated strings. That way we can
+reliably find the key by doing a substring search on `"\x00${key}\x00"`.
The index blocks are the intermediate B-tree nodes and consist of a sorted list
of block references interleaved with keys. Like this:
@@ -162,19 +161,21 @@ containing 63,941,069 passwords at 247 MiB original size.
|:----------------|--------------:|------------------:|------------:|
| Naive (plain) | 684 | 6,376 | 0.16 (6.1 s)|
| Naive (gzip) | 246 | 6,340 | 0.12 (8.3 s)|
-| B-tree plain 1k | 698 | 6,460 | 17,857.14 |
-| B-tree plain 4k | 687 | 6,436 | 9,345.79 |
-| B-tree gzip 1k | 261 | 10,772 | 9,345.79 |
-| B-tree gzip 4k | 244 | 10,572 | 5,076.14 |
-| B-tree zstd 1k | 291 | 6,856 | 12,345.68 |
-| B-tree zstd 4k | 268 | 6,724 | 6,944.44 |
-| LMDB | 1,282 | 590,792 | 333,333.33 |
+| B-tree plain 1k | 698 | 6,460 | 17,857 |
+| B-tree plain 4k | 687 | 6,436 | 9,345 |
+| B-tree gzip 1k | 261 | 10,772 | 9,345 |
+| B-tree gzip 4k | 244 | 10,572 | 5,076 |
+| B-tree zstd 1k | 291 | 6,856 | 12,345 |
+| B-tree zstd 4k | 268 | 6,724 | 6,944 |
+| LMDB | 1,282 | 590,792 | 333,333 |
Well shit. My little B-tree experiment does have an awesome size/performance
ratio when compared to the Naive approach (little surprise there), but the
performance difference with LMDB is *insane*. Although, really, that isn't too
surprising either, LMDB is written in C and has been *heavily* optimized for
-performance.
+performance. LMDB's memory usage in this benchmark should be taken with a grain
+of salt, its use of [mmap()](https://manned.org/mmap) tends to [throw off
+memory measurements](https://lmdb.readthedocs.io/en/release/#memory-usage).
I used the default compression levels of zstd and gzip. I expect that a
slightly higher compression level for zstd could reduce the database sizes to
@@ -192,13 +193,13 @@ Here's the same benchmark with the `crackstation.txt.gz` dictionary, containing
|:----------------|--------------:|------------------:|------------:|
| Naive (plain) | 14,968 | 38,536 | 0.01 (110 s)|
| Naive (gzip) | 4,245 | 38,760 | 0.01 (136 s)|
-| B-tree plain 1k | 15,377 | 6,288 | 15,384.62 |
-| B-tree plain 4k | 15,071 | 6,456 | 8,196.72 |
-| B-tree gzip 1k | 4,926 | 10,780 | 7,352.94 |
-| B-tree gzip 4k | 4,344 | 10,720 | 4,273.50 |
-| B-tree zstd 1k | 5,389 | 6,708 | 10,000.00 |
-| B-tree zstd 4k | 4,586 | 6,692 | 5,917.16 |
-| LMDB | 26,453 | 3,259,368 | 266,666.67 |
+| B-tree plain 1k | 15,377 | 6,288 | 15,384 |
+| B-tree plain 4k | 15,071 | 6,456 | 8,196 |
+| B-tree gzip 1k | 4,926 | 10,780 | 7,352 |
+| B-tree gzip 4k | 4,344 | 10,720 | 4,273 |
+| B-tree zstd 1k | 5,389 | 6,708 | 10,000 |
+| B-tree zstd 4k | 4,586 | 6,692 | 5,917 |
+| LMDB | 26,453 | 3,259,368 | 266,666 |
The main conclusion I draw from this benchmark is that the B-tree
implementation scales pretty well with increasing database sizes, as one would
@@ -216,10 +217,10 @@ Is this the best we can do? No way! Let's start with some low-hanging fruit:
necessary.
- It's possible to encode an "offset inside block" field to the block
references and add a few more strings to the index blocks, allowing parts of
- a block can be skipped when searching for the key. This allows one to get
- some of the performance benefits of smaller block sizes without paying for
- the increase in database size. Or store multiple (smaller) intermediate
- B-tree nodes inside a single block. Same thing.
+ a block to be skipped when searching for the key. This way we can get some of
+ the performance benefits of smaller block sizes without the costs of an
+ increase in database size. Or store multiple (smaller) intermediate B-tree
+ nodes inside a single block. Same thing.
- The lookup function could be rewritten in a faster language (C/C++/Rust), I'm
pretty sure this would be a big win, too.
@@ -228,7 +229,7 @@ is to use a hash function to assign strings to leaf blocks, store an array of
block pointers in the index blocks (without interleaving keys) and then use the
hash function to index the array for lookup. This makes it harder to cap the
size of leaf blocks, but with the small password strings that's not likely to
-be a problem. It does significantly complicates creating the database in the
+be a problem. It does significantly complicate creating the database in the
first place.
Perhaps an even better approach is to not store the strings in the first place
@@ -240,7 +241,7 @@ the least amount of code and with a database size that isn't too far off from
the compressed dictionary. That goal turned out to be pretty unambitious.
-[^1]: Yeah, people still use Perl 5.
+[^1]: Yeah, people still use Perl 5 in 2019.
[^2]: But note that the passwords in the CrackStation dictionary are not sorted
according to Perl's string comparison algorithm, so it requires a separate
`LC_COLLATE=C sort` command to fix that. Also note that sorting a billion
@@ -251,5 +252,5 @@ the compressed dictionary. That goal turned out to be pretty unambitious.
free space on my SSD. I had to delete some unused build artefacts from other
projects in an emergency in order for `sort` to be able to finish sorting and
writing the *Naive (plain)* database, upon which all the others are based.
- [Ncdu](/ncdu) has saved this experiment, its author deserves a tasty pizza
- for dinner today.
+ [Ncdu](/ncdu) has saved this experiment, its author deserves an extra tasty
+ pizza for dinner today.