|
Phone book of 1 000 000 entries in 1Mb
Last post 12-30-2008 3:12 PM by m0ffx. 33 replies.
-
11-21-2008 11:08 AM
|
|
-
glatapoui


- Joined on 11-21-2008
- Posts 3
|
Phone book of 1 000 000 entries in 1Mb
Hello everybody. I'm new here but I would like to submit this question I had once in an Interview : "How would you store 1 000 000 entries of phone numbers (with 8 digits, that was the rule in France until 1996) in 1 Mega byte ?" I must admit I failed this one. Would you have succeeded ?... Regards, Glat'
|
|
-
-
PJH


- Joined on 02-14-2007
- Posts 636
|
Re: Phone book of 1 000 000 entries in 1Mb
I'd hire Earl Bradley of WEB Technology to do it for me. Or maybe Jules Gilbert.
This is not a problem that requires infinite wisdom, Benj. This is a problem that requires enough neural organization to qualify as a vertebrate, apparently a stretch for some folks these days. - Cecil Adams.
|
|
-
-
GettinSadda


- Joined on 05-25-2006
- North-East Scotland
- Posts 265
|
Re: Phone book of 1 000 000 entries in 1Mb
Interesting... it depends very much on the full details of the problem; for example "0100 0000 to 0199 9999" is a million 8-digit numbers in 22 bytes, but not very useful! More details please...
Linux is not a code base. Or a distro. Or a kernel. It's an attitude. And it's not about Open Source. It's about a bunch of people who still think vi is a good config UI.
Notice: Phorm, and its agents including ISPs collecting data on Phorm's behalf, are specifically forbidden from performing any processing or monitoring of the content of the above post. Hence, under the Regulation of Investigatory Powers Act 2000 any such attempt to profile this page by Phorm or its agents is illegal.
|
|
-
-
glatapoui


- Joined on 11-21-2008
- Posts 3
|
Re: Phone book of 1 000 000 entries in 1Mb
Well, it's been a while... If I remember correctly, it must be searchable (for example : give me the phone number #666 of 1 000 000 :) I can't recall the answer they gave me, but it wasn't about strings, I guess. Otherwise it's simple : stringify and cat all the number entries together and zip it with some efficient tool (Huffman code or whatsoever). Then, hope (or prove) that it doesn't run beyond the limit. I tried with all the possibilities I know of, but in vain. I guess one must use bit arrays, but I don't know how to go under this limit of 1Mb...
|
|
-
-
Welbog


- Joined on 02-08-2007
- Posts 449
|
Re: Phone book of 1 000 000 entries in 1Mb
glatapoui:"How would you store 1 000 000 entries of phone numbers (with 8 digits, that was the rule in France until 1996) in 1 Mega byte ?"
We're talking about this in IRC (#tdwtfmafia on irc.slashnet.org if you want to hop in).
Each 8-digit number can take up 8 bits per character if you store them as strings, but if you sort them and store the differences between the numbers instead, you'll only need 7 bits per number, on average. 7 bits per number for 1 million numbers is 7/8ths of a megabyte, but that doesn't take into account delimiting the differences (since they can have variable size).
Other than that, compression is always an options, but requires more than a megabyte of working memory to scan through. The store-the-differences option can't handle direct lookups, but it can calculate the nth phone number from the previous n-1 numbers in under a megabyte of space.
But it's a pretty crazy question in the first place.
My first instinct was to try storing the values in a trieve with a specialization of only dealing with the characters 0-9 (so 4 bits per character). Using this method, you would theoretically need to store only 1 111 111 characters, each at 4 bites, meaning it would take up slightly more than half of a megabyte. This one wouldn't allow you to search numbers by index, but given a number it would let you know whether it exists or not. If searching is a requirement then this one isn't an option.
Then again if searching is required the store-the-differences method isn't an option either since it requires sorting.
|
|
-
-
Nelle


- Joined on 11-08-2007
- graz.at.earth.milkyway.universe
- Posts 212
|
Re: Phone book of 1 000 000 entries in 1Mb
Welbog:This one wouldn't allow you to search numbers by index why not ?
|
|
-
-
Welbog


- Joined on 02-08-2007
- Posts 449
|
Re: Phone book of 1 000 000 entries in 1Mb
Nelle: Welbog:This one wouldn't allow you to search numbers by index
why not ?
I guess I'm using the wrong word for it. The correct term is "trie" instead of "trieve". That's what I get for knowing the origin of the word rather than the word itself. The implementation details answer your question trivially:
http://en.wikipedia.org/wiki/Radix_tree
http://en.wikipedia.org/wiki/Trie
|
|
-
-
PJH


- Joined on 02-14-2007
- Posts 636
|
Re: Phone book of 1 000 000 entries in 1Mb
Welbog:We're talking about this in IRC...
Welbog:specialization of only dealing with the characters 0-9 (so 4 bits per character).
And the phrase 'binary coded decimal' (BCD) didn't rear it's head?
This is not a problem that requires infinite wisdom, Benj. This is a problem that requires enough neural organization to qualify as a vertebrate, apparently a stretch for some folks these days. - Cecil Adams.
|
|
-
-
bstorer


- Joined on 02-01-2007
- Alexandria, VA
- Posts 2,333
|
Re: Phone book of 1 000 000 entries in 1Mb
PJH: Welbog:We're talking about this in IRC...
Welbog:specialization of only dealing with the characters 0-9 (so 4 bits per character).
And the phrase 'binary coded decimal' (BCD) didn't rear it's head?
I honestly don't understand the point of this comment. Are you trying to show off that you know the formal name of such a storage method? Because we all know that, too. Or are you trying to imply that BCD is somehow bad for this? Because when storage constraints are so strict, you'd be insane to waste as much space as you use.
|
|
-
-
belgariontheking


- Joined on 08-20-2007
- Cincinnati, OH, USA
- Posts 1,459
|
Re: Phone book of 1 000 000 entries in 1Mb
ASCII generally compresses down to about 1/10 its original size when zipped [citation needed] so I say take the <8MB ASCII file and zip it. You might have better luck using PACKED-DECIMAL (yes I'm using the COBOL term for it) but I don't know anything about the compression rates for that. It would start at <4MB and need to be compressed less. Or whatever PJH decided to call it.
SpectateSwamp: I can see you. You don't have to hide anymore. C'mon out and play!
|
|
-
-
PJH


- Joined on 02-14-2007
- Posts 636
|
Re: Phone book of 1 000 000 entries in 1Mb
bstorer:I honestly don't understand the point of this comment.
OP mentioned they had a talk in IRC(!), they mentioned something similar to BCD, but didn't mention BCD. bstorer:Because we all know that, too.
Given this particular cicumstance, I'd suggest 'we' don't.
Feel free to correct me if you were on this IRC session.
This is not a problem that requires infinite wisdom, Benj. This is a problem that requires enough neural organization to qualify as a vertebrate, apparently a stretch for some folks these days. - Cecil Adams.
|
|
-
-
PJH


- Joined on 02-14-2007
- Posts 636
|
Re: Phone book of 1 000 000 entries in 1Mb
belgariontheking:Or whatever PJH decided to call it.
BCD. I didn't decide the name. Before my time.
2 digits per byte. Insufficient given the criteria.
This is not a problem that requires infinite wisdom, Benj. This is a problem that requires enough neural organization to qualify as a vertebrate, apparently a stretch for some folks these days. - Cecil Adams.
|
|
-
-
belgariontheking


- Joined on 08-20-2007
- Cincinnati, OH, USA
- Posts 1,459
|
Re: Phone book of 1 000 000 entries in 1Mb
PJH:OP mentioned they had a talk in IRC(!)
Is there some new definition of OP that doesn't mean "The guy (or post) that started this thread?"
SpectateSwamp: I can see you. You don't have to hide anymore. C'mon out and play!
|
|
-
-
bstorer


- Joined on 02-01-2007
- Alexandria, VA
- Posts 2,333
|
Re: Phone book of 1 000 000 entries in 1Mb
PJH: bstorer:I honestly don't understand the point of this comment.
OP mentioned they had a talk in IRC(!), they mentioned something similar to BCD, but didn't mention BCD. bstorer:Because we all know that, too.
Given this particular cicumstance, I'd suggest 'we' don't.
Feel free to correct me if you were on this IRC session.
I still don't get what your point is other than flexing your big, sexy brain. Who cares what it's called? It doesn't matter if it's got a name or not, the math still comes up the same. For the record, though, I was in the channel at the time, but not participating in the conversation. The term "BCD" did not come up. I know the term, but even had I been in the conversation, I wouldn't have bothered with it. Welbog has since informed that he wasn't familiar with the term, which surprised me, because it's not an especially esoteric term. But his point was the same: "I've probably heard the term. Like I said, terms aren't important to me. The fact that you need 4 bits to store 10 values is what's important, regardless of what it's called."
|
|
-
-
Autonuke


- Joined on 11-07-2006
- Posts 49
|
Re: Phone book of 1 000 000 entries in 1Mb
The only way I can think of is like this: there are 10^8 C 10^6 = 10^8! / (10^6! * (10^8 - 10^6)!) possible ways of choosing 1 million distinct 8-digit numbers in, say, ascending order. That's a number that requires slightly more than 8 million bits (Python and me make it 8079302.3053085813), less than a megabyte (okay, 'mebibyte'). So there's a unique 1-megabyte number for every possible list, and a unique list for (nearly) every 1-megabyte number. Of course, you'd have to do a bit of processing to actually search for the n-th phone number in a list encoded this way.
Strange interview question though. How much time did they allow you?
|
|
-
-
Nelle


- Joined on 11-08-2007
- graz.at.earth.milkyway.universe
- Posts 212
|
Re: Phone book of 1 000 000 entries in 1Mb
glatapoui:I can't recall the answer they gave me, but it wasn't about strings, I guess.
if you need to save 1000000 phone numbers in 1 Mb RAM, that gives you approx. 1 byte per 8 digit number... however to save one 8 digit number you need at least 3 bytes + change ... it all boils down to the entropy of the list, but in the worst case scenario (IMHO) it is impossible ...
you could save the differences as welbog sugested, but I would modify his approach and require a presorted list in order to save the differences of the entire telephone numbers...
In 99 milion possibles and 1 milion numbers, with the presorted list the largest difference is 99 which easily fits in one byte and there are couple of bits more which can be used to store keyframes. the downslide is that you always have to start from the begining (or the last keyframe) in order to return the exact number.
|
|
-
-
curtmack


- Joined on 10-25-2007
- Posts 43
|
Re: Phone book of 1 000 000 entries in 1Mb
Going back to the whole "sort the list and store only differences" idea, let's think of this a different way: an eight digit decimal number can store a result from 0 to 99,999,999. We're guaranteed there are 1 million phone numbers, which means that the greatest difference that could occur between two numbers is 98,999,999. That can be stored in 27 bits, but this is impossible - processors aren't designed to work with odd numbers of bits. What's needed is some way to use arbitrary-sized numbers and make sure the program knows where the dividing lines are. 32 bits will store any number in our range, so it's the maximum. Then we could have three bytes, two bytes, or one byte as the other possible dividing lines. Let's suppose we precede the file with a series of 1 million two-bit flags indicating the length of the corresponding term. This adds up to 250 kilobytes, which could be squeezed into the memory of a modest computer. Then it just runs through the list, more or less. The problem is that this approach could never actually get below a megabyte. But it's a start. Any other ideas?
|
|
-
-
GettinSadda


- Joined on 05-25-2006
- North-East Scotland
- Posts 265
|
Re: Phone book of 1 000 000 entries in 1Mb
Well, try this: * Sort the numbers * Convert to a list of differences from the previous number * Huffman code this list The chances are that most of the differences will be 1, which should code as a single bit. The entropy of the whole list should easily be less than 8 bits. Searching would be slow as you would need to scan from the start, but on any modern CPU that would still be very quick.
Linux is not a code base. Or a distro. Or a kernel. It's an attitude. And it's not about Open Source. It's about a bunch of people who still think vi is a good config UI.
Notice: Phorm, and its agents including ISPs collecting data on Phorm's behalf, are specifically forbidden from performing any processing or monitoring of the content of the above post. Hence, under the Regulation of Investigatory Powers Act 2000 any such attempt to profile this page by Phorm or its agents is illegal.
|
|
-
-
Nelle


- Joined on 11-08-2007
- graz.at.earth.milkyway.universe
- Posts 212
|
Re: Phone book of 1 000 000 entries in 1Mb
glatapoui:I guess one must use bit arrays, but I don't know how to go under this limit of 1Mb... glatapoui: If I remember correctly, it must be searchable (for example : give me the phone number #666 of 1 000 000 :)
First of all thanks for posting this problem, this one was a really nice headbreaker ... Presort the list and store it in the following format :
struct telephonebook_block { keyframe : 27; diff1 : 7; ... diff27 : 7; }
one such block can hold 28 telephone numbers and requires 27 bytes to store them ... to store 1.000.020 numbers, you need 35 715 blocks and take up 964305 bytes ... the lookup is pretty fast since you only do 27 diff. additions in the worst case scenario ...
the tricky part is getting the number since you can not create an array of bitfields (at least in C) so the solution would go something like this :
struct strkeyframe { unsigned keyframe : 27; };
struct strdiff { unsigned diff : 7; };
uint32 get_number_at ( uint32 index, uchar8 * telephonebook ) { uint32 return_value = 0; uint32 keyframe_index = index / 27; uint32 number_index = index % 27; uchar8 * block_start = (telephonebook + keyframe_index * 27); // ---> uchar8 * diff_start = block_start + 2; // +++>
// get keyframe strkeyframe * pkeyframe = (strkeyframe *) blockstart; return_value = (uint32) pkeyframe->keyframe;
// got lucky if (!number_index) return return_value;
// get the diffs for (uint32 diff_idx = 0; diff_idx < number_index; diff_idx++) { uint32 byte_step = (7 * diff_idx + 3) / 8; uint32 numblock = *( (uint32 *) (diff_start + byte_step) ); numblock = numblock << ((7 * diff_idx - 4) % 8); return_value += (uint32) ((strdiff *) & numblock)->diff; }
return return_value; }
|
|
-
-
Welbog


- Joined on 02-08-2007
- Posts 449
|
Re: Phone book of 1 000 000 entries in 1Mb
Nelle:one such block can hold 28 telephone numbers and requires 27 bytes to store them
You only have 7 bits to store the difference. What if the first number is 00 000 000 and the second number is 00 000 129?
If you're using a different keyframe for 00 000 129, then you're using 27 bytes to store 00 000 000. (27 bits for the keyframe 00 000 000 and all 27 7-bit fields have to be 0 (indicating this frame is over). All you need is 37038 (37038 numbers taking 27 bytes of space each > 1 megabyte) consecutive phone numbers differing by 129 and you're over 1 megabyte. 37038 * 129 = 4 777 902, which is well within the domain of 8-digit phone numbers.
You made the assumption somewhere that no two consecutive can differ by more than 99, when the difference can be as much as 99 000 001 (first number is 00 000 000, second is 99 000 001, remaining numbers count up until 99 999 999). The average difference will be 100, but that's not a maximum. That's why I had brought up the fact that to store the differences you'd have to use a delimited storage mechanism to account for the variability in differences.
|
|
-
-
Goplat


- Joined on 03-22-2006
- Posts 56
|
Re: Phone book of 1 000 000 entries in 1Mb
Here's a way that's guaranteed to work, if you don't mind violating a few patents. Make a bitmap of all 100000000 possible phone numbers. 99% are not present, 1% are present. Now use arithmetic coding to compress the bitmap, which will store each "not present" in -log2(0.99) = .0144995696951151 bits, and each "present" in -log2(0.01) = 6.64385618977472 bits. In total, this will take 99000000 * -log2(0.99) + 1000000 * -log2(0.01) = 8079314 bits, which is 1009915 bytes or about 0.96 megabytes.
|
|
-
-
Nelle


- Joined on 11-08-2007
- graz.at.earth.milkyway.universe
- Posts 212
|
Re: Phone book of 1 000 000 entries in 1Mb
Welbog:You made the assumption somewhere that no two consecutive can differ by more than 99, when the difference can be as much as 99 000 001 (first number is 00 000 000, second is 99 000 001, remaining numbers count up until 99 999 999). you're right ... i wanted to avoid scaning from the begining, but it seems impossible ... Welbog:That's why I had brought up the fact that to store the differences you'd have to use a delimited storage mechanism to account for the variability in differences.
Well GettinSadda and Goplat suggested Huffman coding and aritmetic coding, the question now is only how large would the decoding table be or would the encoded data + table fit in the 1 Mb ...
|
|
-
-
Welbog


- Joined on 02-08-2007
- Posts 449
|
Re: Phone book of 1 000 000 entries in 1Mb
Nelle:Well GettinSadda and Goplat suggested Huffman coding and aritmetic coding, the question now is only how large would the decoding table be or would the encoded data + table fit in the 1 Mb ...
I can't speak for Huffman, but Goplat's idea looks completely feasible.
Goplat:In total, this will take 99000000 * -log2(0.99) + 1000000 * -log2(0.01) = 8079314 bits, which is 1009915 bytes or about 0.96 megabytes.
We have .04 megabytes to store the fact that [0-.99) -> 0 and [.99,1) -> 1. That seems like a trivially easy thing to store.
Unless we're counting the size of the algorithm required to decode the numbers, I think this is a safe bet.
The Huffman idea will probably cut down to under a megabyte in the average cases, but if there's a wide variety of distinct differences between the phone numbers, the dictionary is going to get large and the encodings themselves will get too large.
I cast my vote for arithmetic encoding, except it doesn't preserve initial order because it requires what amounts to sorting. That wasn't part of the specification of the problem so hopefully it's OK.
|
|
-
-
GettinSadda


- Joined on 05-25-2006
- North-East Scotland
- Posts 265
|
Re: Phone book of 1 000 000 entries in 1Mb
Yup, it seems that arithmetic coding will, just, manage it: Zero ratio = 99000000:1000000
Random data generated in 9.48 seconds Encoder test completed in 4.74 seconds Decoder test completed in 5.20 seconds
Used 1009922 bytes for coding 100000000 symbols
Linux is not a code base. Or a distro. Or a kernel. It's an attitude. And it's not about Open Source. It's about a bunch of people who still think vi is a good config UI.
Notice: Phorm, and its agents including ISPs collecting data on Phorm's behalf, are specifically forbidden from performing any processing or monitoring of the content of the above post. Hence, under the Regulation of Investigatory Powers Act 2000 any such attempt to profile this page by Phorm or its agents is illegal.
|
|
-
|
|