Codementor Events

Do you really think you know strings in Python?

Published Jan 12, 2018Last updated Jul 10, 2018
Do you really think you know strings in Python?

Last weekend, out of curiosity, I was set to explore how strings are implemented in CPython. The results were overwhelming. I realized that how little I knew about different string concepts and optimizations in Python.

What's more fascinating is that a lot of programmers are unaware of these concepts. With these insights, I have an opportunity to baffle a few Python programmers out there.

Okay, I'll begin with a simple snippet that I ran in my IPython interpreter,

>>> a = "wtf"
>>> b = "wtf"
>>> a is b
True

>>> a = "wtf!"
>>> b = "wtf!"
>>> a is b
False

>>> a, b = "wtf!", "wtf!"
>>> a is b
True

Makes sense, right?

Well if it doesn't, hold on. Everything will feel obvious after a while. But before we do so, let me throw my next snippet at you.

>>> a = "some_string"
>>> id(a)
140420665652016
>>> id("some" + "_" + "string") # Notice that both the ids are same.
140420665652016

Alright, one final attack before we take a break.

>>> 'a' * 20 is 'aaaaaaaaaaaaaaaaaaaa'
True
>>> 'a' * 21 is 'aaaaaaaaaaaaaaaaaaaaa'
False

None of these outcomes would have made sense to me a couple of weeks ago, and seeing them, I'd have faced an existential crisis as a regular Python programmer.

Screenshot from 2018-01-11 23-25-43.png

Anyway, if you have no idea what's going on here, every outcome is just a consequence of a concept called "String interning" here.

Wait, what is this "string interning"?

As strings are immutable objects in Python. It is possible for multiple variables to reference the same string object to save memory rather than creating a new object every time.

Python sometimes implicitly interns strings. The decision of when to implicitly intern a string is implementation dependent. However, some facts can make it easy for us to guess if a string will be interned or not:

  • All length 0 and length 1 strings are interned.
  • Strings are interned at compile time ('wtf' will be interned but ''.join(['w', 't', 'f'] will not be interned)
  • Strings that are not composed of ASCII letters, digits or underscores, are not interned. This explains why 'wtf!' was not interned due to !.

Oh I see, but why was "wtf! interned in a, b = "wtf!", "wtf!"

Well, when a and b are set to "wtf!" in the same line, the Python interpreter creates a new object, then references the second variable at the same time. If you do it on separate lines, it doesn't "know" that there's already wtf! as an object (because "wtf!" is not implicitly interned as per the facts mentioned above).

It's a compiler optimization and specifically applies to the interactive environment. When you enter two lines in a live interpreter, they're compiled separately, and therefore optimized independently. If you were to try this example in a .py file, you would not see the same behavior, because the file is compiled all at once.

**Ah, but what's with this 'a' * 21 is 'aaaaaaaaaaaaaaaaaaaaa'?

The expression 'a'*20 is replaced by 'aaaaaaaaaaaaaaaaaaaa' during compilation to reduce few clock cycles during runtime. But since the python bytecode generated after compilation is stored in .pyc files, the strings greater than length of 20 are discarded for peephole optimization (Why? Imagine the size of .pyc file generated as a result of the expression 'a'*10**10)

Now just go through those snippets again. It all seems obvious now, doesn't it? That's the beauty of Python! ❤

Wait, there's more!!!

Yes, peace won't come easy. Here's another snippet:

>>> print("\\ some string \\")
>>> print(r"\ some string")
>>> print(r"\ some string \")

    File "<stdin>", line 1
      print(r"\ some string \")
                             ^
SyntaxError: EOL while scanning string literal

What happened there?

Actually, in a raw string literal, as indicated by the prefix r, the backslash doesn't have the special meaning. What the interpreter actually does, though, is simply change the behavior of backslashes, so they pass themselves and the following character through. That's why backslashes don't work at the end of a raw string.

Here's one more...

>>> print('wtfpython''')
wtfpython
>>> print("wtfpython""")
wtfpython
>>> # The following statements raise `SyntaxError`
>>> # print('''wtfpython')
>>> # print("""wtfpython")

Can you guess why print('''wtfpython') or print("""wtfpython") would raise a "SyntaxError" here?

If your answer is no, It's time to introduce a new concept called "implicit string concatenation." Python supports implicit string literal concatenation.

For example:

  >>> print("wtf" "python")
  wtfpython
  >>> print("wtf" "") # or "wtf"""
  wtf

''' and """ are also string delimiters in Python, which causes a SyntaxError because the Python interpreter was expecting a terminating triple quote as delimiter while scanning the currently encountered triple quoted string literal.

Moar.. (it's the last, pinky promise)

# using "+", three strings:
>>> timeit.timeit("s1 = s1 + s2 + s3", setup="s1 = ' ' * 100000; s2 = ' ' * 100000; s3 = ' ' * 100000", number=100)
0.25748300552368164
# using "+=", three strings:
>>> timeit.timeit("s1 += s2 + s3", setup="s1 = ' ' * 100000; s2 = ' ' * 100000; s3 = ' ' * 100000", number=100)
0.012188911437988281

Notice the stark difference in execution times of s1 += s2 + s3 and s1 = s1 + s2 + s3? += is actually faster than + for concatenating more than two strings because the first string (example, s1 for s1 += s2 + s3) is not destroyed while calculating the complete string.

ok-stop-thats-enough-of-your-bullshit.jpg

Before I conclude, here's a quick fact, oh wait, a quick realization first, "Strings are collection of characters, much like ourselves." (I know it's too much now). Now coming to the fact, 'a'[0][0][0][0][0] is a semantically valid statement in Python. Why? Because strings are immutable sequences (iterables supporting element access using integer indices) in Python. The implementation of the above concepts and optimizations in Python is possible due to this very fact.

Alright, here we end. I hope you find this post interesting and informative. If you would like to learn about more such hidden Python gems, I'd recommend you check out What the f**ck Python? which is a curated collection of such subtle and tricky snippets.

Discover and read more posts from Satwik Kansal
get started
post comments11Replies
Дарья Корсакова
7 years ago

Actually, this works in python2

>>> a, b = "wtf!", "wtf!"
>>> a is b
True

But not in python3 - it will be False

ajit
7 years ago

Thanks for the article. Still trying to explain myself why below code snippet is the way it is:

‘a’ * 20 is ‘aaaaaaaaaaaaaaaaaaaa’
True

‘a’ * 21 is ‘aaaaaaaaaaaaaaaaaaaaa’
False

Both cases the two objects are created on same line, so should be interned. No??

ajit
7 years ago

Or is 20 the limit on the length of the string after which interning is not done?

Satwik Kansal
7 years ago

Exactly! I have updated the description though.

Adam French
8 years ago

I love little intricacies in different languages like this. They’re delightful to learn about

Satwik Kansal
8 years ago

Yeah! I was surprised that why didn’t I know these things for so long…

Show more replies