Thursday, April 23, 2020

Finding The Mean

If you perform a series of coin tosses of a fair coin, over a large number of trials, you expect the average of heads and tails to be around 50/50. What I did here below is generate an array, a list, of random numbers, real numbers, between 0 and 1. Then I find the mean of the values of the array. What you see is me taking the sum of a list in Python, itself generated through a list comprehension using the random module to choose random values (float numbers). The idea is to generate numbers between 0 and 1 randomly and average them to test the law of large numbers. I want to end up with a value that is close to the expected value you get from a large number of trials. In my mind, and I'm not a mathematician, I think this is a Gaussian distribution. I'm a big fan of all things related to noise and this experiment makes me think of Gaussian white noise, as a distribution in whatever dimensions.




I'm not a Python expert, but this is the quickest way I could find to get the mean of the array of size n of random values that I wanted. I tried other ways before, but when I refactored it, I got into a functional style of programming and wrote it as a mathematical function. That's why I used a lambda function, because it's an anonymous function, as far as I can tell. Again, I'm not an expert in theoretical computer science either. I just know I used the timeit module and found this function to be relatively quick. I tried the statistics module in Python 3.8 and it was super slow.

I tried doing everything in Numpy, but it was slower than when I used builtin functions like sum. Instead of importing mean functions from different modules, it was better just to sum the array of size n and divide by n to get the mean or expected value. Basically, I create an array of size n of what are essentially probabilities, real numbers, float numbers in Python, between 0 and 1. Then I run many trials, like say 1 million, and get the mean (average value) and it is usually 0.50 or else 0.4999.

I was really wondering about code optimization in Python. As I said, I'm not an expert in Python or in theoretical computer science, and I'm certainly not a mathematician. I just needed a way to come up with a large number of probabilities in a list (an array, a one-dimensional vector). I found I could do this easily with a list comprehension, i.e. [random.random() for i in range(n)], in Python, using the random module. I tried different ways of calculating those random probabilities. My thesis was proved, though. The mean after a million trials is 0.50 or 0.4999 or else 0.5001. Anyway it turns out that one should try to use builtin functions when possible, because they are usually faster. It's like writing the function in C, almost. There is no interpretation, is what I understood from my quick overview of the subject.

I was starting to look into the internals of Python to try to see why builtin functions would be faster. This is the best summary I could find so far:

Use Built-in Data Types
This one is pretty obvious. Built-in data types are very fast, especially in comparison to our custom types like trees or linked lists. That’s mainly because the built-ins are implemented in C, which we can’t really match in speed when coding in Python. - Making Python Programs Blazingly Fast

Addendum:
"A simple rule of thumb (but one you must back up using profiling!) is that more lines of bytecode will execute more slowly than fewer equivalent lines of bytecode that use built-in functions." - p.55, High Performance Python: Practical Performant Programming for Humans by Ian Ozsvald and Micha Gorelick
                                                                         * * *
I did a little more research. After looking into profiling and code optimization, I got to thinking of implementing my function in othe programming languages, with the idea that maybe it would be faster in Fortran or C or whatnot. I'm mostly just familiar with Python, but I was able to write the function, in what I think is working code, in both the R language and in GNU Octave:


The file is named avg_probs.m with my function in GNU Octave code. Here it is in the R programming language.

Those were a few of the other languages that I was able to "translate" my function into, my "average of probabilities" function, as I am calling it now, or avg_probs().