Do you really think you know strings in Python?
Last weekend, out of curiosity, I was set to explore how strings are implemented in CPython. The results were overwhelming. I realized that how little I knew about different string concepts and optimizations in Python.
What's more fascinating is that a lot of programmers are unaware of these concepts. With these insights, I have an opportunity to baffle a few Python programmers out there.
Okay, I'll begin with a simple snippet that I ran in my IPython interpreter,
>>> a = "wtf"
>>> b = "wtf"
>>> a is b
True
>>> a = "wtf!"
>>> b = "wtf!"
>>> a is b
False
>>> a, b = "wtf!", "wtf!"
>>> a is b
True
Makes sense, right?
Well if it doesn't, hold on. Everything will feel obvious after a while. But before we do so, let me throw my next snippet at you.
>>> a = "some_string"
>>> id(a)
140420665652016
>>> id("some" + "_" + "string") # Notice that both the ids are same.
140420665652016
Alright, one final attack before we take a break.
>>> 'a' * 20 is 'aaaaaaaaaaaaaaaaaaaa'
True
>>> 'a' * 21 is 'aaaaaaaaaaaaaaaaaaaaa'
False
None of these outcomes would have made sense to me a couple of weeks ago, and seeing them, I'd have faced an existential crisis as a regular Python programmer.
Anyway, if you have no idea what's going on here, every outcome is just a consequence of a concept called "String interning" here.
Wait, what is this "string interning"?
As strings are immutable objects in Python. It is possible for multiple variables to reference the same string object to save memory rather than creating a new object every time.
Python sometimes implicitly interns strings. The decision of when to implicitly intern a string is implementation dependent. However, some facts can make it easy for us to guess if a string will be interned or not:
- All length 0 and length 1 strings are interned.
- Strings are interned at compile time ('wtf'will be interned but''.join(['w', 't', 'f']will not be interned)
- Strings that are not composed of ASCII letters, digits or underscores, are not interned. This explains why 'wtf!'was not interned due to!.
Oh I see, but why was "wtf! interned in a, b = "wtf!", "wtf!"
Well, when a and b are set to "wtf!" in the same line, the Python interpreter creates a new object, then references the second variable at the same time. If you do it on separate lines, it doesn't "know" that there's already wtf! as an object (because "wtf!" is not implicitly interned as per the facts mentioned above).
It's a compiler optimization and specifically applies to the interactive environment. When you enter two lines in a live interpreter, they're compiled separately, and therefore optimized independently. If you were to try this example in a .py file, you would not see the same behavior, because the file is compiled all at once.
**Ah, but what's with this 'a' * 21 is 'aaaaaaaaaaaaaaaaaaaaa'?
The expression 'a'*20 is replaced by 'aaaaaaaaaaaaaaaaaaaa' during compilation to reduce few clock cycles during runtime. But since the python bytecode generated after compilation is stored in .pyc files, the strings greater than length of 20 are discarded for peephole optimization (Why? Imagine the size of .pyc file generated as a result of the expression 'a'*10**10)
Now just go through those snippets again. It all seems obvious now, doesn't it? That's the beauty of Python!  ️
️
Wait, there's more!!!
Yes, peace won't come easy. Here's another snippet:
>>> print("\\ some string \\")
>>> print(r"\ some string")
>>> print(r"\ some string \")
    File "<stdin>", line 1
      print(r"\ some string \")
                             ^
SyntaxError: EOL while scanning string literal
What happened there?
Actually, in a raw string literal, as indicated by the prefix r, the backslash doesn't have the special meaning. What the interpreter actually does, though, is simply change the behavior of backslashes, so they pass themselves and the following character through. That's why backslashes don't work at the end of a raw string.
Here's one more...
>>> print('wtfpython''')
wtfpython
>>> print("wtfpython""")
wtfpython
>>> # The following statements raise `SyntaxError`
>>> # print('''wtfpython')
>>> # print("""wtfpython")
Can you guess why print('''wtfpython') or print("""wtfpython") would raise a "SyntaxError" here?
If your answer is no, It's time to introduce a new concept called "implicit string concatenation." Python supports implicit string literal concatenation.
For example:
  >>> print("wtf" "python")
  wtfpython
  >>> print("wtf" "") # or "wtf"""
  wtf
''' and """ are also string delimiters in Python, which causes a SyntaxError because the Python interpreter was expecting a terminating triple quote as delimiter while scanning the currently encountered triple quoted string literal.
Moar.. (it's the last, pinky promise)
# using "+", three strings:
>>> timeit.timeit("s1 = s1 + s2 + s3", setup="s1 = ' ' * 100000; s2 = ' ' * 100000; s3 = ' ' * 100000", number=100)
0.25748300552368164
# using "+=", three strings:
>>> timeit.timeit("s1 += s2 + s3", setup="s1 = ' ' * 100000; s2 = ' ' * 100000; s3 = ' ' * 100000", number=100)
0.012188911437988281
Notice the stark difference in execution times of s1 += s2 + s3 and s1 = s1 + s2 + s3? += is actually faster than + for concatenating more than two strings because the first string (example, s1 for s1 += s2 + s3) is not destroyed while calculating the complete string.
Before I conclude, here's a quick fact, oh wait, a quick realization first, "Strings are collection of characters, much like ourselves." (I know it's too much now). Now coming to the fact, 'a'[0][0][0][0][0] is a semantically valid statement in Python. Why? Because strings are immutable sequences (iterables supporting element access using integer indices) in Python. The implementation of the above concepts and optimizations in Python is possible due to this very fact.
Alright, here we end. I hope you find this post interesting and informative. If you would like to learn about more such hidden Python gems, I'd recommend you check out What the f**ck Python? which is a curated collection of such subtle and tricky snippets.


Actually, this works in python2
But not in python3 - it will be False
Thanks for the article. Still trying to explain myself why below code snippet is the way it is:
Both cases the two objects are created on same line, so should be interned. No??
Or is 20 the limit on the length of the string after which interning is not done?
Exactly! I have updated the description though.
I love little intricacies in different languages like this. They’re delightful to learn about
Yeah! I was surprised that why didn’t I know these things for so long…