1

I'm curious if to save on space, you can reassign the parameter and return that rather than creating a whole new string to return. Since it's local to the scope, I figure it shouldn't affect anything other than possibly make it a little more difficult to debug. Is it good practice or even a standard just to create a new variable and return it? Thanks.

E.g. creating and returning new_str:

def string_compression(string):
    new_str = ""

    dict = {}
    for char in string:
        if char not in dict:
            dict[char] = 1
        else:
            dict[char] += 1
    for key, value in dict.items():
        new_str += key + str(value)

    return new_str

Versus: (no new_str variable is created or returned)

def string_compression(string):
    dict = {}
    for char in string:
        if char not in dict:
            dict[char] = 1
        else:
            dict[char] += 1

    string = ""
    for key, value in dict.items():
        string += key + str(value)

    return string
Devin B.
  • 433
  • 4
  • 18
  • Please, see the answer here: https://stackoverflow.com/questions/34008010/is-the-time-complexity-of-iterative-string-append-actually-on2-or-on You will see that your time is quadratic and you will learn how to reduce it to linear – Andrey Nov 23 '20 at 22:08
  • 2
    The fact that you create a new variable is not really relevant here. In *both* versions, you create *a whole new string* (it would be anyway since you have to build that dictionary) – juanpa.arrivillaga Nov 23 '20 at 22:10
  • @Andrey OP is asking about space complexity, not time complexity, although you raise a good point. – ggorlen Nov 23 '20 at 22:31
  • 3
    @Dev please don't call a variable `dict`--it overwrites a builtin function. There's no difference between the two code snippets in terms of space _complexity_ (it's still `O(len(string))` either way), although the second is somewhat more optimal in the sense that it may let the garbage collector reclaim the parameter memory. However, this is a micro optimization that is almost entirely pointless to waste time thinking about. Just write clean code. Higher priorities are using `collections.Counter`, `defaultdict` `join` and avoiding the O(n^2) time complexity here. – ggorlen Nov 23 '20 at 22:34
  • How does using defaultdict reduce time complexity though? I've seen it used before, and it just seems to be implementing a function which does the same check behind the scenes, that I'm writing explicitly above when checking whether that key/value pair exists before modifying it somehow. – Devin B. Nov 23 '20 at 23:31
  • 1
    @Dev using `collections` doesn't improve the time complexity, although, likely the actual performance will be better, since various methods will be optimized in C. – juanpa.arrivillaga Nov 24 '20 at 00:09
  • @Dev the complexity will still be *O(n)*, but you should realise that *O(n)* is a fairly permissive requirement. A function that takes less than *n* hours to execute is said to be *O(n)*. A function that takes less than *n* milliseconds to execute is said to be *O(n)*. Using functions from the standard library will usually be faster than writing your own functions. – Stef Nov 24 '20 at 00:35
  • 1
    Even if it wasn't faster, it would be preferable because it makes your code easier to read. When I read `count_char = collections.Counter(txt)` I immediately understand that your are counting the occurrences of characters in string `txt`. But if instead, I have to read `count_char = {}; for c in txt: if c not in count_char: do_something else: do_something_else` then I have to read every one of those lines, take some time to understand what they do, then take some extra time to make sure they really do what I got the impression that they do. – Stef Nov 24 '20 at 00:37
  • How does appending to the string with each iteration make the time complexity O(n^2)? It is my understanding that there is no need to resize the string in place every iteration—only when the capacity of that string object in memory requires it to, as it is created in heap rather than the stack. – Devin B. Nov 24 '20 at 00:59

2 Answers2

2

To answer your question

In both code snippets, you are creating new strings. The fact that you give a different name to the string in the first version doesn't make it more or less efficient than the second version.

An important remark about naming a variable dict

The name dict is the python name used for the builtin class for dictionaries. If you use that name for one of your own variables, even if it's a dictionary, you are running into a lot of trouble. You should avoid at all costs using builtin names for your own variables. Never call your variables dict, list, str, sum, or any name in that list: Python Built-in Functions. Instead, you can call your dictionary d, or dictionary, or better yet, something that explains what this dictinary is used for; for instance char_count or something similar.

An important note about string concatenation

Consider the following code snippet:

s = ''
for n in range(1000000):
  s += str(n)

It builds a string by concatenating numbers written in decimal: at the end, s is '01234567891011...999997999998999999'.

But is it efficient? No, it's not. The time complexity for concatenating two strings s1 and s2 using + or += is proportional to the total number of characters in s1 and s2. Here we are concatenating '' with '0', then '0' with '1', then '01' with '2', then '012' with '3', then '0123' with '4', etc. DO you see what is happening? This is the story of Schlemiel the painter. The complexity is quadratic instead of linear.

An more efficient way to concatenate more than two strings in python is to concatenate them all at once using ''.join(...). So instead we should write:

s = ''.join(str(n) for n in range(1000000)).

Using python module collections

Let's look at this code snippet:

    d = {}
    for char in string:
        if char not in d:
            d[char] = 1
        else:
            d[char] += 1

Guess what? You're not the first one to have come across the need for this logic. If the value is not already in the dictionary, then add it; otherwise, update it. Four lines of code for such a mondaine operation! Surely we could automatize it a little more? Actually yes, we can. Instead of using a dict, we can use a defaultdict. The defaultdict will behave exactly like a dict would, but it will handle default values for us when needed:

from collections import defaultdict

d = defaultdict(int)
for char in string:
  d[char] += 1 

If d[char] doesn't exist when it's needed by +=, then the defaultdict will create it and give it the value returned by int(), which is 0. Everything works out exactly like before; except we don't have to write the if/else logic ourselves. Cool!

...But wait. Why did we need a dictionary in the first place? To count the number of occurrences of the characters in the string. This sounds like a common problem. In fact, it's so common that there is another subclass of dict even more suited than defaultdict for this! It's called Counter and it's also in module collections. You can replace the code above by:

from collections import Counter

d = Counter(string)

and that's it. Yup. No need for a for-loop to fill the dictionary manually. It's all taken care of already.

Useful reading

Documentation

StackOverflow question

Final code

from collections import Counter

def string_compression(s):
  char_counts = Counter(s)
  return ''.join('{}{}'.format(c,n) for c,n in char_counts.most_common())

print(string_compression('aaaabcbcbb'))
# a4b4c2

If you want to keep the characters in their order of first appearance rather than sorted by decreasing count, you can replace .most_common() with .items() in the code.

Stef
  • 13,242
  • 2
  • 17
  • 28
  • In the case where an interviewer says I can't use the join method to build the string that is ultimately being returned, how would you suggest I build it using the key/value pairs like how I have in the example? Thanks Stef! – Devin B. Nov 23 '20 at 23:40
  • 1
    @Dev If I were in that position, I would look the interviewer in the eye, and explain to them, without losing my self-confidence, that python is the wrong language to refuse using `str.join`. I would offer to write a function in `C` to implement string concatenation manually. And I would point them to the python documentation page that specifically explains that in python, we should absolutely use `str.join` to perform string concatenation. – Stef Nov 23 '20 at 23:45
  • How would that function you'd write in C to implement string concatenation work, if it was to avoid O(n^2) from rebuilding the string through each iteration? Thanks in advance! – Devin B. Nov 23 '20 at 23:50
  • In python, strings are not mutable; so we're stuck with using functions the standard library offers to manipulate them. In C, by contrast, a string is an array of char and the programmer has all control on that array. You can allocate an array sufficiently large to hold the final result; copy the first string into the beginning of the array; copy the second string in the array after the last character of the first string; etc. Complexity of a single copying operation is proportional to the length of the string being copied. Total complexity is proportional to the total number of characters. – Stef Nov 23 '20 at 23:56
  • @Dev I would point you to CPython's implementation of `str.join`, but it might not be the easiest thing to read :p https://github.com/python/cpython/blob/master/Objects/stringlib/join.h#L110 – Stef Nov 24 '20 at 00:00
  • Yea, that's why I was hoping you knew a way in Python to write it! You have a very eloquent and succinct way of explaining for laymen such as myself :) – Devin B. Nov 24 '20 at 00:02
  • 1
    @Dev. Yes. collections.Counter is just a subclass of dict, so you can iterate over it just like you would iterate over a dict, using `.keys()`, `.values()` or `.items()`. In this case, just replace `.most_common()` with `.items()` to get what you want. – Stef Nov 24 '20 at 00:04
  • I had that Aha! moment literally right before you answered. Derp! – Devin B. Nov 24 '20 at 00:05
1

The important term here is mutable, and nope, string are immutable objects, that is, you can't change their contents (try string[0]='a').

Note there is no option at all to reassign an argument and affect the calling environment. There is no pass-by-reference (or pointers) as in other languages. Python uses what I think is called copy-by-reference. An object passes to a function by a copy of the reference to it from the caller (think address if you're coming from a pointer world), and everything in the function refers to that. You can change the original object, but reassign your argument and you lose your reference to it.

There is little chance this will affect space complexity in most cases.

kabanus
  • 24,623
  • 6
  • 41
  • 74