“Facebook parent company Meta Platforms recently claimed the largest version of their upcoming Llama 3 model — which has not yet been released — has been trained on up to 15 trillion tokens, each of which can represent a piece of a word.”
This is not the way language works! This is the doctrine of “logical atomism,” brainchild of Bertrand Russell and his posse, which was roundly discredited in the 1940s (even though Russell continued to believe it, much like a Japanese soldier on a remote island refusing to accept the end of WW II).
They’re using a model of language that is essentially a laughingstock in the 21st century.
If this model were correct, then *people* would “run out of training data,” but they don’t, because that’s not a thing.
Part of Wittgenstein’s refutation of logical atomism was the “rule-following paradox.” You can’t say that people learn or use language by following rules, because for every rule, if you want it to be explicit you need a meta-rule that tells how to apply the rule, and there’s an infinite regress.
The AI people are running up against this in real life (and vindicating Wittgenstein). They have to employ armies of third-world wage slaves to e.g, tag images. “This is a shoe, this is a coat,” etc. They have rule books for when to call something this or that, but the rule books have become insanely bloated and complex, demonstrating the infinite regress of the rule-based approach.
It will never work.