Is there an explanation for why, to search through an unstructured database, the average number of checks is $frac{N}{2}$ in classical computation?

Question

I have came across many books online that all explain that if $N$ is large enough, then the average number of checks in $N/2$ but is there a mathematical explanation or derivation for why this is true?
I have considered when $N=2^n$ and tried to use logarithms but am not getting anywhere with this. Any help would be greatly appreciated
Sorry, I should have made it clear... I was referring to Grover's search algorithm and how I can show that if $N$ is large enough, then the average number of checks is $N/2$. I know I will have to check $N$ objects in the worst-case scenario

DaftWullie · Answer

It depends how formal you want to be about this. You might just start by saying that if you select $N/2$ items out of a database of size $N$, you've got a probability of 1/2 of finding the element you want. That gives a pretty good hint.
But if you want to do the full gory detail: imagine you take samples and keep going until you get a match. What's the probability that you stop on the $n^{th}$ round? It's given by (probability of not matching on 1st round)$times$(probability of not matching on 2nd round)$timesldotstimes$(probability of not matching on (n-1)th round)$times$(prob of matching on nth round).
Assuming we got to the nth round, we have $N+1-n$ samples to pick from, of which 1 is a match and the other $N-n$ do not match. Thus, the probability of ending on the nth round is
$$
p(n)=frac{N-1}{N}timesfrac{N-2}{N-1}timesfrac{N-3}{N-2}timesldotstimesfrac{N+1-n}{N+2-n}timesfrac{1}{N+1-n}.
$$
If you look through all these terms, most cancel and you just get
$$
p(n)=frac{1}{N}.
$$
You're equally likely to find the answer on each selection (perhaps feels obvious now we've calculated it!).
The average number of runs that you need is then
$$
sum_nnp(n)=frac{N+1}{2}.
$$

Yitian Wang · Answer

I think what you are talking about must an unstructured database, which can be improved by Grover search algorithm to $O(sqrt n)$.
When learning courses like computation theory there may be related discussion of the search of the structured and unstructured database and they will tell you that classically the search of unstructured database will take $O(n)$ and structured can be faster.
For the unstructured case, which should be your case, say there are n data inside the database, and then the possibility that an arbitrary index leads to the right answer is $frac{1}{n}$, then if you tested another case and found that it is wrong, then you guess again. At this time the probability of failing is $1-frac{1}{n-1}$. If you are failing continuously, then the corresponding possibility is $displaystyleprod_{i=n-x+1}^{n}frac{i-1}{i}$ at step x.
Given the probability formula, then at step $x=frac{n}{2}$, the failure probability is $p=displaystyleprod_{i=n/2+1}^{n}frac{i-1}{i}=frac{(n-1)!/(n/2)!}{n!/(n/2+1)!}=frac{n+2}{2n}$, which equals 0.5 when $nrightarrowinfty$ (at step n you can see that the success possibility is 1).

AinstineSaidEnough · Answer

Experiment - is ours Friend!
See, I've wrote a little script, to test this theory:
#!/usr/bin/python3

import time
import random

li      = [ ]
size        = 1000000
size        = 100000
size        = 10000     # 3 seconds
# size      = 1000      # 0.2 seconds
# size      = 100

DEBUG       = False
# DEBUG     = True

start_time  = time.time( )
for i in range( 0, size ):
    li.append( i )
random.shuffle( li )
speed       = time.time( ) - start_time

print( "gen done in:", speed, "seconds" )

if DEBUG:
    print( li )

def find_i( i ):
    N = 0
    for j in li:
        if i == j:
            return N
        N += 1

start_time  = time.time( )

accum_N     = 0
for i in range( 0, size ):
    accum_N += find_i( i )

speed       = time.time( ) - start_time
print( "find done in:", speed, "seconds" )

#--------------------------------------------------------
average     = accum_N / size

relation    = size / average

print( "accum_N:", accum_N )
print( "size:", size )
print( "average:", average )
print( "size relate to average number of steps as", int( relation ), "to 1" )

Output:
$ python3 script.py
gen done in: 0.004632234573364258 seconds
find done in: 3.0313122272491455 seconds
accum_N: 49995000
size: 10000
average: 4999.5
size relate to average number of steps as 2 to 1

On the gen stage, we are simply pushing every int from range 0 to size === $N$.
Thereafter, shuffle the result list.
Thereafter, we are goin to find each integer, from range 0 to size in this list, and accumulate number of comparisons / steps in accum_N variable
as you can see, the number - is $N/2$

Is there an explanation for why, to search through an unstructured database, the average number of checks is $frac{N}{2}$ in classical computation?

3 Answers

Add your own answers!

Ask a Question