Coder Social home page Coder Social logo

Comments (13)

jugge83 avatar jugge83 commented on June 5, 2024 1

@safijari you should make the video. I can Imagine that it will be popular and the more Python resources that are out there the better.
Sadly the cache=True kwarg or an initial "pre-run" of the function didn't improve execution times for me.

from primes.

rhbvkleef avatar rhbvkleef commented on June 5, 2024

I banged my head against this for a while, as I saw some inefficiencies as compared to Dave´s implementation and other solutions (including my Numba implementation). Yours was faster to start with, but that seemed to be mainly because you dropped the class.

I optimized it a bit, and got it faster than the C++ solution on my machine:

Implementation kops/s
Python with Numba 1.4
Python without numba 1.1
C++ 1.1
from numba import njit, b1, int32
from math import sqrt
import numpy as np

@njit(b1[:](int32))
def run_sieve(limit):
    prime_list = np.full(limit // 2, True)
    prime_list[0] = False

    for factor in range(3, int(sqrt(limit)) + 1, 2):
        if prime_list[factor // 2]:
            prime_list[(factor * 3) // 2:limit // 2:factor] = False

    return prime_list[:(limit // 2) + 1]

The only thing is that 2 needs to be added to the list of primes, as even numbers are not considered:

import itertools

def primes(sieve):
    return itertools.chain([2], (prime * 2 + 1 for (prime, is_prime) in enumerate(sieve) if is_prime))

Additionally, I am seeing a shocking amount of cache misses and branch mispredicts. I don´t think we can do much about the mispredicts (and I don´t really know where they come from: they are likely an interpreter artifact), but a segmented sieve could be employed to improve locality on the data-cache.

from primes.

jugge83 avatar jugge83 commented on June 5, 2024

Nice trick to only look at the odd integers!

I think the major part of my speed improvement came from the removal of the double for loops in the while loop of the sieve method, But any function or method calls add overhead.

How do you check the cache misses and branch mispredicts?

I managed to squeeze out a couple of thousand iterations more by starting at factor ** 2 instead of factor * 3 and possibly the fastmath flag, I don't think it matters if you use the math.sqrt() or ** (1/2) function/operator.

Now I get approx. 17 000 passes compared to the initial 40 (425 times faster). Not bad for a slow language ;)

import timeit
import itertools
from numba import njit, b1, int32
import numpy as np


@njit(b1[:](int32), fastmath = True)
def run_sieve(limit):
    prime_list = np.full(limit // 2, True)  # defines a list for odd integers
    prime_list[0] = False                   # 1 is not a prime
    for factor in range(3, int(limit ** (1/2)) + 1, 2):
        if prime_list[factor // 2]:         # just looking at odd integers everything halves
            prime_list[(factor ** 2) // 2:limit // 2:factor] = False
    return prime_list[:(limit // 2) + 1]

def print_results(time_delta, passes, primes, limit):
    prime_counts = {
        10: 4,  # Historical data for validating our results - the number of primes
        100: 25,  # to be found under some limit, such as 168 primes under 1000
        1000: 168,
        10000: 1229,
        100000: 9592,
        1000000: 78498,
        10000000: 664579,
        100000000: 5761455,
    }
    print(f"""Passes: {passes}\nTime: {time_delta}\nAvg: {time_delta/passes}\
        \nLimit: {limit}\nCount: {len(primes)}\nValid: {prime_counts[limit]==len(primes)}""")

def primes(sieve):
    return itertools.chain([2], (prime * 2 + 1 for (prime, is_prime) in enumerate(sieve) if is_prime))

if __name__=="__main__":
    t_start = timeit.default_timer()  # Record our starting time
    passes = 0
    limit = 1_000_000
    while timeit.default_timer() - t_start < 10:
        prime_validators = run_sieve(limit)
        passes = passes + 1
    time_delta = timeit.default_timer() - t_start
    primes = list(primes(prime_validators))
    print(primes[:100])
    print_results(time_delta, passes, primes, limit)

from primes.

rhbvkleef avatar rhbvkleef commented on June 5, 2024

I think the major part of my speed improvement came from the removal of the double for loops in the while loop of the sieve method

Array slicing on numpy arrays is very fast, but it´s only a performance gain when not using Numba. Numba was able to optimize a loop to set each value separately.

How do you check the cache misses and branch mispredicts?

I used perf.

I managed to squeeze out a couple of thousand iterations more by starting at factor ** 2 instead of factor * 3

Cool! :)

from primes.

safijari avatar safijari commented on June 5, 2024

Oh wow. I was thinking about doing this and making a video for my channel as a sort of playful response but it's already been done, and really well. Kudos!

from primes.

iJohnnis avatar iJohnnis commented on June 5, 2024

@jugge83 & @rhbvkleef & @safijari:

I'm not a python expert but I believe that this is an unfair comparison given that your code does not initialize a new class each time and, probably, reuses the same array. Also, I think you're using sequences instead of arrays, which means that most of the code does not actually execute until you use the list operation.

Try executing your code once with a 100 million sieve size (1e8).

My results were:
• 347 ms for c++ (Dave's code)
• 413 ms for c# (an optimized version of Dave's code)
• 493 ms for python (your code)

Edit 03/04/2021: I had to fix the number because it will confuse ppl that read this and not the posts below this. My old results were 4093ms for python but I accidentally had the code running for 1000000000 (1e9) instead of the 1e8 that I had on the other languages. That was 100% my mistake and it is not representative of the language or the code. I have to add here that running once and comparing it doesn't mean much, all I wanted to illustrate with this example is the class initialization cost that the other two languages compare to rhbvkleef's.
If anyone is interested in the c# code that I run you can find it here: https://github.com/iJohnnis/PrimeSieve/tree/master/PrimeSieveCS.

Generally going faster than c++ means that something is wrong, given that c/c++ are the lowest level languages (apart from assembly) and that python uses c ... The only way that I see python doing anything faster than c is wrong/inefficient c code.

from primes.

jugge83 avatar jugge83 commented on June 5, 2024

I'm getting the following:

  • 143 ms for Daves C++ code.
  • 298 ms for the above Python code.

The code does not need to be object oriented and my goal was to optimize the python code as much as I could.
For me, Python development is way faster than C++ development ;)

from primes.

safijari avatar safijari commented on June 5, 2024

Something that I haven't seen considered here is taking the compilation time into account, since the first invocation of run_seive would take a fair bit longer because of Numba compiling it down into LLVM bytecode. Can be remedied either by running the function once before the timing starts or by using the cache=True kwarg so on subsequent runs the compilation doesn't need to happen.

There's also AOT compilation for Numba. Speaking of fair, that would the most fair way to do the comparison.

from primes.

safijari avatar safijari commented on June 5, 2024

from primes.

iJohnnis avatar iJohnnis commented on June 5, 2024

I'm getting the following:

  • 143 ms for Daves C++ code.
  • 298 ms for the above Python code.

You're right, I accidentally posted the results for the 1 billion sieve-size when I run the python code.

The code does not need to be object oriented and my goal was to optimize the python code as much as I could.

Yeah, but if you have to compare the two languages, you have to do similar things.
c++ doesn't need to initialize a new object each time either.

For me, Python development is way faster than C++ development ;)

Yeah, no doubt. Python is, one of the fastest, if not the fastest, languages if you're comparing development time.

from primes.

rhbvkleef avatar rhbvkleef commented on June 5, 2024

@iJohnnis

I'm not a python expert but I believe that this is an unfair comparison given that your code does not initialize a new class each time [...

This is true, but I´ve also written an alternative that does initialize a class each time, and the performance is very similar. It is slightly slower, taking off a couple of percent, so it is not significant in any way.

...] and, probably, reuses the same array.

It might, it might not, but that is a matter of page reuse, not a matter of data reuse. It is likely that the same happens in C++ and C#, and is thus, entirely fair.

Also, I think you're using sequences instead of arrays, which means that most of the code does not actually execute until you use the list operation.

This is demonstrably false. I extracted the LLVM IR, and that clearly shows that the code produces the actual sieve representation.

Try executing your code once with a 100 million sieve size (1e8).
My results were:

  • 347 ms for c++ (Dave's code)
  • 413 ms for c# (an optimized version of Dave's code)
  • 4093 ms for python (your code)

My measurements were these:

Language Iterations (limit=1 million) Iterations (limit=100 million)
C++ (GCC) 11556 79
C++ (Clang) 11267 83
C# (Mono) 1521 Not measured
C# (DotNet Core) 3317 Not measured
Python (with classes) 10345 32
Python (this implementation) 14845 35

So I´m not entirely sure what to make of it. The python implementation implementation seems to at least be comparable to slightly better for a lower limit, whilst it is significantly slower for higher limit. I am certain that the reason that Python is comparable or faster for lower limits, is simply because it is actually faster, and that the C++ implementation isn´t as good as can be.

Generally going faster than c++ means that something is wrong, given that c/c++ are the lowest level languages

Indeed, something might be wrong. We should better look at whether the C++ implementation is as efficient as can be. We must consider this too, and I would like to point the finger squarely at this.

(Technically C, and especially C++, are not the lowest level languages barring assembly, but I simply wish to point that out, and actually invalidating your point with that would be entirely unfair, and incorrect)

The only way that I see python doing anything faster that c is wrong/inefficient c code.

Indeed, and that seems to be the case. (there are more hypothetical options too, but I don´t think that is the case)


@safijari, @jugge83

Something that I haven't seen considered here is taking the compilation time into account, since the first invocation of run_seive would take a fair bit longer because of Numba compiling it down into LLVM bytecode.

sadly the cache=True kwarg or an initial "pre-run" of the function didn't improve execution times for me.

The reason that we don´t see that, is because by simply adding a type signature in the decorator, a compilation for that type signature is triggered.

from primes.

iJohnnis avatar iJohnnis commented on June 5, 2024

@rhbvkleef if you're interested please read my edited comment. The c# code that I wrote is closer to the c++, the main difference is that I can't initialize the boolean array on true on c# so I used the opposite boolean, a "NoPrimes" array. The code that is written here is closer to the initial version of Dave's code which was a lot more inefficient. Note that I have practically zero experience on Github.

from primes.

rgtoomey avatar rgtoomey commented on June 5, 2024

Dear Dave. I have enjoyed watching your videos. I know some about programming (and am trying to learn more as a hobby). I sat down this weekend, struggled with Anaconda, learned about Github (as linked to your video), and I re-wrote your code (I think) in a way that makes sense to me. I had to look up seive, by the way. Please keep up the videos.

On my 13" Macbook Pro, this code computes the correct number of primes up to 1e9 in about 5 seconds. I tried 1e10, but my laptop decided it was too much. In some of the above posts, users mentioned a Numba package, but this did not really change the speed of my program.

import numpy as np
import math as m
from time import time


def primes(maxNumber):                                      #Function that returns all the prime numbers up to a maximum number   

    start=time()
    
    booleanTestArray = primeSearch(maxNumber)               #Returns a Boolean test of all odd numbers between 2 and the maximum number. A prime number returns true.
    primesArray=getPrimeNumbers(booleanTestArray)           #Converts the Boolean test to an array of prime numbers between 2 and the maximum number
    
    finish=time()  
    
    print(finish-start)     
    return primesArray


def primeSearch(maxNumber) :
    
    possibleTotalPrimes = (maxNumber + 1) // 2              #Sets the total number of potential prime number candidates
    booleanTestArray = np.full(possibleTotalPrimes,True)    #Establishes Boolean test array corresponding to [2,3,5,7, ..., possibleTotalPrimes]
    maxFactor = m.floor(m.sqrt(maxNumber))                  #Largest number potentially factorable into maxNumber
    
    for factor in range(3,maxFactor+1,2) :                  #Test all divisible factors starting from 3 up the maximum
        if booleanTestArray[factor//2] :                    #Check if divisible factor has not been set to false (meaning all multiples of it has already been eliminated as potential primes) 
            booleanTestArray[(factor**2)//2:(maxNumber+1):factor] = False   #If the divisible factor is a prime, set all multiples of the prime number to false in the booleanTestArray
    return booleanTestArray


def getPrimeNumbers(booleanTestArray) :                     #Converts booleanTestArray to array of prime numbers 
    
    primesArray = np.where(booleanTestArray)[0]
    primesArray= 2*primesArray+1 
    primesArray[0]=2
    return primesArray

from primes.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.