Hello Dave. I saw your YouTube video and I instantly became a fan ;)

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I'm getting the following: 143 ms for Daves C++ code. <li

Faster python implementation. ,about plummerssoftwarellc/primes

Comments (13)

jugge83 commented on August 26, 2024 1

@safijari you should make the video. I can Imagine that it will be popular and the more Python resources that are out there the better.
Sadly the cache=True kwarg or an initial "pre-run" of the function didn't improve execution times for me.

from primes.

rhbvkleef commented on August 26, 2024

I banged my head against this for a while, as I saw some inefficiencies as compared to Dave´s implementation and other solutions (including my Numba implementation). Yours was faster to start with, but that seemed to be mainly because you dropped the class.

I optimized it a bit, and got it faster than the C++ solution on my machine:

Implementation	kops/s
Python with Numba	1.4
Python without numba	1.1
C++	1.1

from numba import njit, b1, int32
from math import sqrt
import numpy as np

@njit(b1[:](int32))
def run_sieve(limit):
    prime_list = np.full(limit // 2, True)
    prime_list[0] = False

    for factor in range(3, int(sqrt(limit)) + 1, 2):
        if prime_list[factor // 2]:
            prime_list[(factor * 3) // 2:limit // 2:factor] = False

    return prime_list[:(limit // 2) + 1]

The only thing is that 2 needs to be added to the list of primes, as even numbers are not considered:

import itertools

def primes(sieve):
    return itertools.chain([2], (prime * 2 + 1 for (prime, is_prime) in enumerate(sieve) if is_prime))

Additionally, I am seeing a shocking amount of cache misses and branch mispredicts. I don´t think we can do much about the mispredicts (and I don´t really know where they come from: they are likely an interpreter artifact), but a segmented sieve could be employed to improve locality on the data-cache.

from primes.

jugge83 commented on August 26, 2024

Nice trick to only look at the odd integers!

I think the major part of my speed improvement came from the removal of the double for loops in the while loop of the sieve method, But any function or method calls add overhead.

How do you check the cache misses and branch mispredicts?

I managed to squeeze out a couple of thousand iterations more by starting at factor ** 2 instead of factor * 3 and possibly the fastmath flag, I don't think it matters if you use the math.sqrt() or ** (1/2) function/operator.

Now I get approx. 17 000 passes compared to the initial 40 (425 times faster). Not bad for a slow language ;)

import timeit
import itertools
from numba import njit, b1, int32
import numpy as np


@njit(b1[:](int32), fastmath = True)
def run_sieve(limit):
    prime_list = np.full(limit // 2, True)  # defines a list for odd integers
    prime_list[0] = False                   # 1 is not a prime
    for factor in range(3, int(limit ** (1/2)) + 1, 2):
        if prime_list[factor // 2]:         # just looking at odd integers everything halves
            prime_list[(factor ** 2) // 2:limit // 2:factor] = False
    return prime_list[:(limit // 2) + 1]

def print_results(time_delta, passes, primes, limit):
    prime_counts = {
        10: 4,  # Historical data for validating our results - the number of primes
        100: 25,  # to be found under some limit, such as 168 primes under 1000
        1000: 168,
        10000: 1229,
        100000: 9592,
        1000000: 78498,
        10000000: 664579,
        100000000: 5761455,
    }
    print(f"""Passes: {passes}\nTime: {time_delta}\nAvg: {time_delta/passes}\
        \nLimit: {limit}\nCount: {len(primes)}\nValid: {prime_counts[limit]==len(primes)}""")

def primes(sieve):
    return itertools.chain([2], (prime * 2 + 1 for (prime, is_prime) in enumerate(sieve) if is_prime))

if __name__=="__main__":
    t_start = timeit.default_timer()  # Record our starting time
    passes = 0
    limit = 1_000_000
    while timeit.default_timer() - t_start < 10:
        prime_validators = run_sieve(limit)
        passes = passes + 1
    time_delta = timeit.default_timer() - t_start
    primes = list(primes(prime_validators))
    print(primes[:100])
    print_results(time_delta, passes, primes, limit)

from primes.

rhbvkleef commented on August 26, 2024

I think the major part of my speed improvement came from the removal of the double for loops in the while loop of the sieve method

Array slicing on numpy arrays is very fast, but it´s only a performance gain when not using Numba. Numba was able to optimize a loop to set each value separately.

How do you check the cache misses and branch mispredicts?

I used perf.

I managed to squeeze out a couple of thousand iterations more by starting at factor ** 2 instead of factor * 3

Cool! :)

from primes.

safijari commented on August 26, 2024

Oh wow. I was thinking about doing this and making a video for my channel as a sort of playful response but it's already been done, and really well. Kudos!

from primes.

iJohnnis commented on August 26, 2024

@jugge83 & @rhbvkleef & @safijari:

I'm not a python expert but I believe that this is an unfair comparison given that your code does not initialize a new class each time and, probably, reuses the same array. Also, I think you're using sequences instead of arrays, which means that most of the code does not actually execute until you use the list operation.

Try executing your code once with a 100 million sieve size (1e8).

My results were:
• 347 ms for c++ (Dave's code)
• 413 ms for c# (an optimized version of Dave's code)
• 493 ms for python (your code)

Edit 03/04/2021: I had to fix the number because it will confuse ppl that read this and not the posts below this. My old results were 4093ms for python but I accidentally had the code running for 1000000000 (1e9) instead of the 1e8 that I had on the other languages. That was 100% my mistake and it is not representative of the language or the code. I have to add here that running once and comparing it doesn't mean much, all I wanted to illustrate with this example is the class initialization cost that the other two languages compare to rhbvkleef's.
If anyone is interested in the c# code that I run you can find it here: https://github.com/iJohnnis/PrimeSieve/tree/master/PrimeSieveCS.

Generally going faster than c++ means that something is wrong, given that c/c++ are the lowest level languages (apart from assembly) and that python uses c ... The only way that I see python doing anything faster than c is wrong/inefficient c code.

from primes.

jugge83 commented on August 26, 2024

I'm getting the following:

143 ms for Daves C++ code.
298 ms for the above Python code.

The code does not need to be object oriented and my goal was to optimize the python code as much as I could.
For me, Python development is way faster than C++ development ;)

from primes.

safijari commented on August 26, 2024

Something that I haven't seen considered here is taking the compilation time into account, since the first invocation of run_seive would take a fair bit longer because of Numba compiling it down into LLVM bytecode. Can be remedied either by running the function once before the timing starts or by using the cache=True kwarg so on subsequent runs the compilation doesn't need to happen.

There's also AOT compilation for Numba. Speaking of fair, that would the most fair way to do the comparison.

from primes.

safijari commented on August 26, 2024

I was thinking I could do a shorter video that reviews this code instead. Lemme see, I have other projects to finish first.

…

On Thu, Apr 1, 2021 at 10:37 AM jugge83 ***@***.***> wrote: @safijari <https://github.com/safijari> you should make the video. I can Imagine that it will be popular and the more Python resources that are out there the better. Sadly the cache=True kwarg or an initial "pre-run" of the function didn't improve execution times for me. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/davepl/Primes/issues/25#issuecomment-812062989>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHTRJHOZ2L32RXZSD6XT5LTGSVMHANCNFSM4Z5CH3XA> .

-- *Jariullah Safi* *Director of AI and Computer Vision* Simbe Robotics <http://www.simberobotics.com/>

from primes.

iJohnnis commented on August 26, 2024

I'm getting the following:

143 ms for Daves C++ code.

298 ms for the above Python code.

You're right, I accidentally posted the results for the 1 billion sieve-size when I run the python code.

The code does not need to be object oriented and my goal was to optimize the python code as much as I could.

Yeah, but if you have to compare the two languages, you have to do similar things.
c++ doesn't need to initialize a new object each time either.

For me, Python development is way faster than C++ development ;)

Yeah, no doubt. Python is, one of the fastest, if not the fastest, languages if you're comparing development time.

from primes.

rhbvkleef commented on August 26, 2024

@iJohnnis

I'm not a python expert but I believe that this is an unfair comparison given that your code does not initialize a new class each time [...

This is true, but I´ve also written an alternative that does initialize a class each time, and the performance is very similar. It is slightly slower, taking off a couple of percent, so it is not significant in any way.

...] and, probably, reuses the same array.

It might, it might not, but that is a matter of page reuse, not a matter of data reuse. It is likely that the same happens in C++ and C#, and is thus, entirely fair.

Also, I think you're using sequences instead of arrays, which means that most of the code does not actually execute until you use the list operation.

This is demonstrably false. I extracted the LLVM IR, and that clearly shows that the code produces the actual sieve representation.

Try executing your code once with a 100 million sieve size (1e8).
My results were:

347 ms for c++ (Dave's code)

413 ms for c# (an optimized version of Dave's code)

4093 ms for python (your code)

My measurements were these:

Language	Iterations (limit=1 million)	Iterations (limit=100 million)
C++ (GCC)	11556	79
C++ (Clang)	11267	83
C# (Mono)	1521	Not measured
C# (DotNet Core)	3317	Not measured
Python (with classes)	10345	32
Python (this implementation)	14845	35

So I´m not entirely sure what to make of it. The python implementation implementation seems to at least be comparable to slightly better for a lower limit, whilst it is significantly slower for higher limit. I am certain that the reason that Python is comparable or faster for lower limits, is simply because it is actually faster, and that the C++ implementation isn´t as good as can be.

Generally going faster than c++ means that something is wrong, given that c/c++ are the lowest level languages

Indeed, something might be wrong. We should better look at whether the C++ implementation is as efficient as can be. We must consider this too, and I would like to point the finger squarely at this.

(Technically C, and especially C++, are not the lowest level languages barring assembly, but I simply wish to point that out, and actually invalidating your point with that would be entirely unfair, and incorrect)

The only way that I see python doing anything faster that c is wrong/inefficient c code.

Indeed, and that seems to be the case. (there are more hypothetical options too, but I don´t think that is the case)

@safijari, @jugge83

Something that I haven't seen considered here is taking the compilation time into account, since the first invocation of run_seive would take a fair bit longer because of Numba compiling it down into LLVM bytecode.

sadly the cache=True kwarg or an initial "pre-run" of the function didn't improve execution times for me.

The reason that we don´t see that, is because by simply adding a type signature in the decorator, a compilation for that type signature is triggered.

from primes.

iJohnnis commented on August 26, 2024

@rhbvkleef if you're interested please read my edited comment. The c# code that I wrote is closer to the c++, the main difference is that I can't initialize the boolean array on true on c# so I used the opposite boolean, a "NoPrimes" array. The code that is written here is closer to the initial version of Dave's code which was a lot more inefficient. Note that I have practically zero experience on Github.

from primes.

rgtoomey commented on August 26, 2024

Dear Dave. I have enjoyed watching your videos. I know some about programming (and am trying to learn more as a hobby). I sat down this weekend, struggled with Anaconda, learned about Github (as linked to your video), and I re-wrote your code (I think) in a way that makes sense to me. I had to look up seive, by the way. Please keep up the videos.

On my 13" Macbook Pro, this code computes the correct number of primes up to 1e9 in about 5 seconds. I tried 1e10, but my laptop decided it was too much. In some of the above posts, users mentioned a Numba package, but this did not really change the speed of my program.

import numpy as np
import math as m
from time import time


def primes(maxNumber):                                      #Function that returns all the prime numbers up to a maximum number   

    start=time()
    
    booleanTestArray = primeSearch(maxNumber)               #Returns a Boolean test of all odd numbers between 2 and the maximum number. A prime number returns true.
    primesArray=getPrimeNumbers(booleanTestArray)           #Converts the Boolean test to an array of prime numbers between 2 and the maximum number
    
    finish=time()  
    
    print(finish-start)     
    return primesArray


def primeSearch(maxNumber) :
    
    possibleTotalPrimes = (maxNumber + 1) // 2              #Sets the total number of potential prime number candidates
    booleanTestArray = np.full(possibleTotalPrimes,True)    #Establishes Boolean test array corresponding to [2,3,5,7, ..., possibleTotalPrimes]
    maxFactor = m.floor(m.sqrt(maxNumber))                  #Largest number potentially factorable into maxNumber
    
    for factor in range(3,maxFactor+1,2) :                  #Test all divisible factors starting from 3 up the maximum
        if booleanTestArray[factor//2] :                    #Check if divisible factor has not been set to false (meaning all multiples of it has already been eliminated as potential primes) 
            booleanTestArray[(factor**2)//2:(maxNumber+1):factor] = False   #If the divisible factor is a prime, set all multiples of the prime number to false in the booleanTestArray
    return booleanTestArray


def getPrimeNumbers(booleanTestArray) :                     #Converts booleanTestArray to array of prime numbers 
    
    primesArray = np.where(booleanTestArray)[0]
    primesArray= 2*primesArray+1 
    primesArray[0]=2
    return primesArray

from primes.

Faster python implementation. about primes HOT 13 CLOSED

Comments (13)

Try executing your code once with a 100 million sieve size (1e8).

My results were:
• 347 ms for c++ (Dave's code)
• 413 ms for c# (an optimized version of Dave's code)
• 493 ms for python (your code)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (13)

Try executing your code once with a 100 million sieve size (1e8).

My results were: • 347 ms for c++ (Dave's code) • 413 ms for c# (an optimized version of Dave's code) • 493 ms for python (your code)

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

My results were:
• 347 ms for c++ (Dave's code)
• 413 ms for c# (an optimized version of Dave's code)
• 493 ms for python (your code)