Protecting a Python codebase

The very nature of Python makes the task of protecting the source code complicated. As an interpreted language, the source code must be available in some form in order to execute it.

During this article, I’ll write down the steps I’ve followed while trying to find a fit solution to the problem of protecting a Python based codebase.

The explored techniques are implemented in the companion repository with a few more attempts that will be discussed in a second article. The objective in that project was to protect a simple Flask application with NumPy as a dependency.

Note: I’ve left obfuscation outside of the equation on purpose, it’s a well know mechanism. Python code obfuscation is an extensive topic in and of itself that we might cover in another post.

Distributing bytecode

Taking baby steps, the first thing to attempt is to distribute byte-compiled modules, the usual .pyc files created by Python interpreter for performance reasons, it’s not faster code, but its load time is shorter.

The idea is to compile any module and distribute them instead of the traditional .py. Python comes with the compileall module that will process all the .py files in a directory tree, the invocation is quite simple:

$ python -m compileall .

The pyc is a simple binary file containing:

A magic number (four bytes)
A timestamp (four bytes)
A code object (marshalled code)

It’s fair to think that this is a secure mechanism, it’s a binary file after all, inexperienced eyes will just see garbage when opened with a text editor. But, sadly, it’s not impossible to transform it back to the original code.

Take for instance this really simple example stored in a hello.py module:

1
2
def hello_world(name):
    print('Hello {0}'.format(name))

The byte-compiled module look like this:

0000000: 03f3 0d0a 97b0 fd54 6300 0000 0000 0000  .......Tc.......
0000010: 0001 0000 0040 0000 0073 0d00 0000 6400  .....@...s....d.
0000020: 0084 0000 5a00 0064 0100 5328 0200 0000  ....Z..d..S(....
0000030: 6301 0000 0001 0000 0002 0000 0043 0000  c............C..
0000040: 0073 1200 0000 6401 006a 0000 7c00 0083  .s....d..j..|...
0000050: 0100 4748 6400 0053 2802 0000 004e 7309  ..GHd..S(....Ns.
0000060: 0000 0066 6f6f 202d 207b 307d 2801 0000  ...foo - {0}(...
0000070: 0074 0600 0000 666f 726d 6174 2801 0000  .t....format(...
0000080: 0074 0100 0000 6128 0000 0000 2800 0000  .t....a(....(...
0000090: 0073 0800 0000 6865 6c6c 6f2e 7079 7403  .s....hello.pyt.
00000a0: 0000 0066 6f6f 0100 0000 7302 0000 0000  ...foo....s.....
00000b0: 014e 2801 0000 0052 0200 0000 2800 0000  .N(....R....(...
00000c0: 0028 0000 0000 2800 0000 0073 0800 0000  .(....(....s....
00000d0: 6865 6c6c 6f2e 7079 7408 0000 003c 6d6f  hello.pyt....<mo
00000e0: 6475 6c65 3e01 0000 0073 0000 0000       dule>....s....

Not much to the naked eye, but it’s easy to inspect using the included batteries:

>>> import dis
>>> import marshal
>>> import struct
>>> import imp
>>>
>>> with open('hello.pyc', 'r') as f:  # Read the binary file
...     magic = f.read(4)
...     timestamp = f.read(4)
...     code = f.read()
...
>>>
>>> # Unpack the structure content and un-marshal the code
>>> magic = struct.unpack('<H', magic[:2])
>>> timestamp = struct.unpack('<I', timestamp)
>>> code = marshal.loads(code)
>>> magic, timestamp, code
((62211,), (1425911959,), <code object <module> at 0x7fd54f90d5b0, file "hello.py", line 1>)
>>>
>>> # Verify if magic number corresponds with the current python version
>>> struct.unpack('<H', imp.get_magic()[:2]) == magic
True
>>>
>>> # Disassemble the code object
>>> dis.disassemble(code)
  1           0 LOAD_CONST               0 (<code object hello_world at 0x7f31b7240eb0, file "hello.py", line 1>)
              3 MAKE_FUNCTION            0
              6 STORE_NAME               0 (hello_world)
              9 LOAD_CONST               1 (None)
             12 RETURN_VALUE
>>>
>>> # Also disassemble that const being loaded (our function)
>>> dis.disassemble(code.co_consts[0])
  2           0 LOAD_CONST               1 ('Hello  {0}')
              3 LOAD_ATTR                0 (format)
              6 LOAD_FAST                0 (name)
              9 CALL_FUNCTION            1
             12 PRINT_ITEM
             13 PRINT_NEWLINE
             14 LOAD_CONST               0 (None)
             17 RETURN_VALUE

It’s not hard to get a notion of what the code is about, and tools can be written that will translate the op-codes back to human-friendly strings. In fact, that’s already done by the uncompyle2 package.

Result: distribution of .pyc files is not an valid mechanism of protection.

Optimized bytecode

Python has the concept of optimized bytecode, still taking baby steps, we could distribute optimized code which might help to protect the original codebase.

But Python optimizations aren’t real code optimizations, but mere simplifications. There are two levels of “optimizations”:

Level one (-O flag) This level will discard assert statements from the code.
Level two (-OO flag) This level will apply the optimizations from level 1, plus it will discard docstrings from the code.

So we are back to square one, this time with just a smaller version of .pyc files.

Result: distribution of .pyo files is not an valid mechanism of protection.

Extending the import process

Python supports customization of the import process by the means of import hooks. The interpreter makes use of this feature to implement the import of modules from zip archives.

This concept can be used to import modules from different containers types or pre-process modules before loading them. Take for instance the following finder and loader implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import os
import imp
import sys


class BaseLoader(object):
    def modinfo(self, name, path):
        try:
            modinfo = imp.find_module(name.rsplit('.', 1)[-1], path)
        except ImportError:
            if '.' not in name:
                raise
            clean_path, clean_name = name.rsplit('.', 1)
            clean_path = [clean_path.replace('.', '/')]
            modinfo = imp.find_module(clean_name, clean_path)

        file, pathname, (suffix, mode, type_) = modinfo
        if type_ == imp.PY_SOURCE:
            filename = pathname
        elif type_ == imp.PY_COMPILED:
            filename = pathname[:-1]
        elif type_ == imp.PKG_DIRECTORY:
            filename = os.path.join(pathname, '__init__.py')
        else:
            return (None, None)
        return (filename, modinfo)


class Loader(BaseLoader):
    def __init__(self, name, path):
        self.name = name
        self.path = path

    def load_module(self, name):
        if name in sys.modules:
            return sys.modules[name]

        filename, modinfo = self.modinfo(self.name, self.path)
        if filename is None and modinfo is None:
            return None

        file = modinfo[0] or open(filename, 'r')
        src = ''.join(file.readlines())
        src = self.decrypt(src[len(SIGNATURE):])

        module = imp.new_module(name)
        module.__file__ = filename
        module.__path__ = [os.path.dirname(os.path.abspath(file.name))]
        module.__loader__ = self
        sys.modules[name] = module
        exec(src, module.__dict__)
        print "encrypted module loaded: {0}".format(name)
        return module

    def decrypt(self, input):
        return AESCypher(SECRET).decrypt(input)


class Finder(BaseLoader):
    def find_module(self, name, path=None):
        filename, modinfo = self.modinfo(name, path)
        if filename is None and modinfo is None:
            return None

        file = modinfo[0] or open(filename, 'r')
        if file.read(len(SIGNATURE)) == SIGNATURE:
            print "encrypted module found: {0} (at {1})".format(name, path)
            file.seek(0)
            return Loader(name, path)

Note: AESCypher is defined here.

The loader above is a simple import-hook in sys.meta_path, it intercepts imports of encrypted modules, decrypts and returns them.

The process is quite simple, Python will call find_module, which in our case will check if the given path and name correspond to an encrypted module, if that’s the case, an instance of Loader is returned, which will take care of the read, decrypt, creation and return of the module instance.

Of course, this is just a very simple example of what can be done, better measures must be taken in order to protect the encryption key, also to avoid access to the application from a shell, otherwise any module can be inspected and disassembled as shown in previous examples. Customizing the python interpreter to nullify co_code is an option to avoid such code inspection.

python-crypt-import-hooks is a nice (simple) example of how to implement an import hook as a C extension.

Result: it’s worth checking, but at first look it seems that more protection measures are needed.

Compiling modules with Cython

Cython is an static compiler for Python and Cython programming languages, it simplifies the job of writing Python C extensions. The key part here is that allows us to compile Python code, the result are dynamic libraries which can be used as python modules too.

During the import process this is the order of precedence (dates of modification are also taken into account):

shared library (.so, .pyd)
python bytecode (.pyo, .pyc)
python file (.py)

The benefits of using compiled modules are many, specially for protecting the original Python code:

Binary modules will impose a much harder task to those trying to get the original Python code, reverse engineering techniques must be used to do so.
There’s no co_code to get bytecode, since there’s no bytecode at all, in fact it’s even possible to get rid of func_code completely by using Cython directives.
Cython generated C code can be modified to introduce changes, improve protection, etc.
GCC optimization flags can be used while compiling the library
Tracebacks won’t reveal code, but just line numbers (unless it’s disabled by directives).
Processing speedup, Cython takes Python code and translates it to C, which is then compiled by GCC (or similar), the compiled code will run faster than the pure Python version (most of the time).

To cythonize a codebase, it’s enough to just run Cython + GCC on each module, __init__.py files must be left intact to keep module import working.

Take for instance this simple code (in a hello.py file):

1
2
def hello(name):
	print('Hello {0}'.format(name))

In order to compile it it’s enough to:

$ cython hello.py -o cython.c
$ gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python2.7 -o hello.so hello.c

Cython will spot issues like missing imports, undefined functions during the convert-to-C stage, but runtime errors might appear to, like the use of dis module.

I found specially annoying an incompatibility with keyword arguments. Following the example above, once compiled to an .so, the function cannot be called using keyword arguments:

>>> from hello import hello
>>> hello('world')
Hello world
>>> hello(name='world')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: hello() takes no keyword arguments
>>> hello(**{'name': 'world'})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: hello() takes no keyword arguments

So, depending on the complexity of the codebase, more or less issues will appear, some will be simpler and others will be trickier to solve, and some will mean a sacrifice of code expressiveness in exchange for a good degree of protection.

Result: seems viable enough to properly protect the python code behind an application.

Final thoughts

This article just covers a few attempts to find a solution to the problem. Some do the job, other require more work, in the end Cython remains the more promising option. It’s true that any user will have access to binaries that can be used to reverse engineer the application, but that’s going to take a good amount of time and work.

It’s also possible to combine the different approaches to provide an even more secure environment.

Next time we will be checking the Python interpreter itself and introduce changes in order to make it generate code that’s dependent of the changes in it. Also play with Python Abstract Syntax Trees as an option to scramble code.