SolidStart - Hacker News

eesmith 9 months ago ago

Next try 10 million digits instead of 10 thousand.

The Z-score cannot be interpreted so easily. It must be evaluated over all the different ways people have tried to test the digits for randomness.

"Since the advent of computers, a large number of digits of π have been available on which to perform statistical analysis. Yasumasa Kanada has performed detailed statistical analyses on the decimal digits of π, and found them consistent with normality; for example, the frequencies of the ten digits 0 to 9 were subjected to statistical significance tests, and no evidence of a pattern was found." - https://en.wikipedia.org/wiki/Pi

[-]

seccode 9 months ago ago

I don't think you're correct that the z-score is not being interpreted correctly. I haven't been able to find research demonstrating any non-randomness in pi, so without that we have nothing to compare it to.

[-]

eesmith 9 months ago ago

It's literally an xkcd, #882 https://xkcd.com/882/

There's a large literature of people looking for non-randomness in pi, and failing.

If you try 50,000 different tests, with a z-score of 3.29 then 0.1% - 50 of them - will give false positives.

[-]

seccode 8 months ago ago

Getting higher z-scores now. But you could always just try running the code yourself

thunderbong 9 months ago ago

Mobile version

https://m.xkcd.com/882/

seccode 9 months ago ago

I don't have the resources to try this unfortunately. I highly encourage somebody to try. But with more digits you may have to change the model.

[-]

eesmith 9 months ago ago

Yes, you do. You can download the digits at https://www.pilookup.com/download.html . You can generate models for different sized subsets. If it's really non-random, you should see the predictability stabilize as you get larger.

Otherwise you risk being seen as yet another math crank.

[-]

seccode 9 months ago ago

The issue is not with getting the digits, the issue is with running a large model for larger digit ranges. I tried running with 10,000,000 digits and haven't gotten a prediction yet.

seccode 9 months ago ago

Also, I am testing different ranges of digits other than first 10,000, but the problem with other ranges is that the distribution of digits is highly imbalanced and the model is not showing statistical significance, but models have a harder time when the distribution of classes is not 50/50, so I think its not quite fair to evaluate the model on these ranges.

So why do you think the first 10,000 digits are somewhat predictable?

[-]

eesmith 9 months ago ago

The distribution of digits is 'highly imbalanced' because that's what random distributions look like. I'll randomly select the digits 0-9 for 10,000 times and show the distribution, then do the same with the first 10,000 digits of pi, then do the random distribution again:

  >>> import random
  >>> from collections import Counter
  >>> ctr = Counter(random.choice(range(10)) for i in range(10_000))
  >>> for digit, count in ctr.most_common():
  ...   print(f"{digit}: {count}")
  ...
  2: 1039
  4: 1035
  0: 1031
  7: 1022
  3: 1008
  6: 998
  1: 976
  5: 973
  9: 963
  8: 955
  >>> pi_ctr = Counter(open("1-10000.txt").read().rstrip())
  >>> for digit, count in pi_ctr.most_common():
  ...   print(f"{digit}: {count}")
  ...
  5: 1046
  1: 1026
  2: 1021
  6: 1021
  9: 1014
  4: 1012
  3: 974
  7: 970
  0: 968
  8: 948
  >>> ctr = Counter(random.choice(range(10)) for i in range(10_000))
  >>> for digit, count in ctr.most_common(): print(f"{digit}: {count}")
  ...
  8: 1060
  2: 1048
  0: 1034
  4: 1026
  5: 1025
  3: 979
  7: 977
  6: 960
  1: 956
  9: 935

You can see that the distribution of pi's first 10,000 digits is what one should expect for a random distribution. If your method requires a 50/50 distribution then it cannot be used for this purpose.

Also, you are thinking about it wrong. The first 10,000 digits of pi are perfectly predictable.

[-]

seccode 9 months ago ago

I'm not predicting the number I'm predicting number%2==0. The model predicted better than the distribution probability

[-]

eesmith 9 months ago ago

It doesn't really matter. There are 4970 even digits and 5030 odd digits in the first 10,000. Predicting all odds gives you a better-than-even chance of being right.

What does "highly unbalanced" mean?

How often will a random sequence be "highly unbalanced"?

How many people used another model, found no pattern, and never reported it?

You have plenty of data to work with. Try the second 10,000, the third 10,000 and so on.

Keep clear in your mind that a lot of people worked on this problem, including trained mathematicians. It is far more likely that you do not fully understand what you are doing than that they are wrong. Believing otherwise is the path of crankdom.

[-]

seccode 9 months ago ago

Better to use statistical significance tests to talk about what is "far more likely"

seccode 9 months ago ago

It doesn't predict better than even, it predicts better than the distribution probability

lifthrasiir 9 months ago ago

Because "somewhat predictable" doesn't mean "non-random". In fact, almost all prefixes of algorithmically random bit sequences are somewhat predictable with an appropriate definition of "somewhat", because you can find and exploit an accidental bias from taking the prefix and any such bias translates to some predictability.

(Another possibility is that, since pi itself is not really algorithmically random, the classifier was somehow able to learn how to partially compute pi! That's another pitfall you need to avoid even when you have a good understanding of information theory and statistics...)

[-]

8 months ago ago

[deleted]

lifthrasiir 9 months ago ago

Z-score is not a direct measure of statistical significance, which necessarily needs a correct interpretation of z-score or similar numbers. If my measurement of normally distributed random variable resulted in the z-score of 5.0, it can still occur by a slim (<0.000058%) but still pure chance. If we don't know the variable's distribution in advance, the chance can be as high as 4% per Chebyshev's inequality. You don't have any necessary interpretation to derive your conclusion to exclude such possibilities.

The only thing one can conclude from your claim is that, the first handful number of digits in pi can be very, very slightly compressed if you are somehow able to ignore the size of that classifier [1]. The algorithmic randomness however requires the classifier to be included, and there is no known self-contained program that is smaller than printing all digits in verbatim or computing them in the first place.

[1] By the way, this is not really surprising at all. In fact there are 5,001 (not 5,000) out of first 10,000 decimal digits of pi that are between 5 and 9, so it can be very, very slightly compressed in that way---of course after the decompressor excluded.

[-]

seccode 9 months ago ago

Thanks for teaching me some important statistics!

tgv 9 months ago ago

One obvious problem is the model. There is a model that can predict each digit with 100% accuracy, so you must show in advance that your model will never have a bias, in all instances, not a single run. Even then, the tiniest error in the randomization procedure can have an effect. Perhaps you've only shown that your particular implementation of random forests is biased.

The second problem is the interpretation of statistical significance. You don't even explain how you compute it, and it's not in the code. For starters, what's your df?

Another problem is the sample. It might be possible that any range of 10,000 digits isn't random (according to your criterion), but the whole is. Now that I think about it: it is very likely there is a structure, since it's a rather limited string, and the number of possible bit strings of length n quickly outnumbers the total length of the string for increasing n. 10k bits can only contain the unique strings up to length 10, and those wouldn't be random at all (since you can predict the remaining bits with increasing accuracy). So the details of your prediction model are really essential.

[-]

seccode 8 months ago ago

I used the z-score. How can you claim that the digits of pi are random, yet a random forest classifier predicted better than the distribution probability. Your claim implicitly means "there is no structure." The hard thing to understand is that the classifier didn't see the test set, so what structure did it learn? At the very least this is an interesting question

9 months ago ago

[deleted]

9 months ago ago

[deleted]

[-]

9 months ago ago

[deleted]

9 months ago ago

[deleted]

9 months ago ago

[deleted]

9 months ago ago

[deleted]

9 months ago ago

[deleted]

seccode 9 months ago ago

Worth noting that I showed _statistical significance_, not _proof_

[-]

pentaphobe 9 months ago ago

Not really - you showed an _observation_ about the distribution of decimal digits in a very limited sample of a long (likely infinite) set, but then confidently presented a conclusion about the whole set drawn from this.

If you'll forgive the reductio ad absurdem: I could toss a coin once and post a misleading write up about how coin tosses land on heads 100% of the time. :)

Also notable: the description of your repo (at time of writing) is literally "Statistically significant proof that the digits of pi are not random". The word "proof" there does seem to contradict what you're saying in this message.

Not bullying by the way - I'm well familiar with how exciting it can be to find something cool. (And kudos for sharing)

But if you use terms like "statistically significant proof" then you're making claims / misleading statements (or just being clickbaity) rather than providing objective observations and _separare_ subjective hypetheses.

(IMO) a far more palatable alternative would be "I tried analysing the first N digits of pi with a statistical algorithm they're not evenly distributed." (ie. Just the facts, no claims) or if you want to retain the clickbait, "I used statistics to analyse the first N digits of pi.. you won't believe the 9,000th digit!"

As others have mentioned here: of course it's not random. We can predict the next digit with 100% accuracy. But even meeting you halfway and choosing to interpret your use of "random" as meaning "evenly distributed" it's a big reach

[-]

seccode 9 months ago ago

Thanks this is a good point, I'll change proof to evidence

[-]

pentaphobe 9 months ago ago

Thanks for the very reasonable response to my diatribe hehe

jqpabc123 9 months ago ago

Pi = circumference / diameter

Pi is irrational but definitely not random.

The digits of pi are not random