Protecting a Python codebase
The very nature of Python makes the task of protecting the source code complicated. As an interpreted language, the source code must be available in some form in order to execute it.
During this article, I’ll write down the steps I’ve followed while trying to find a fit solution to the problem of protecting a Python based codebase.
The explored techniques are implemented in the companion repository with a few more attempts that will be discussed in a second article. The objective in that project was to protect a simple Flask application with NumPy as a dependency.
Note: I’ve left obfuscation outside of the equation on purpose, it’s a well know mechanism. Python code obfuscation is an extensive topic in and of itself that we might cover in another post.
Distributing bytecode
Taking baby steps, the first thing to attempt is to distribute byte-compiled
modules, the usual .pyc
files created by Python interpreter for performance
reasons, it’s not faster code, but its load time is shorter.
The idea is to compile any module and distribute them instead of the
traditional .py
. Python comes with the compileall module that will process
all the .py
files in a directory tree, the invocation is quite simple:
The pyc
is a simple binary file containing:
- A magic number (four bytes)
- A timestamp (four bytes)
- A code object (marshalled code)
It’s fair to think that this is a secure mechanism, it’s a binary file after all, inexperienced eyes will just see garbage when opened with a text editor. But, sadly, it’s not impossible to transform it back to the original code.
Take for instance this really simple example stored in a hello.py
module:
1
2
def hello_world(name):
print('Hello {0}'.format(name))
The byte-compiled module look like this:
Not much to the naked eye, but it’s easy to inspect using the included batteries:
It’s not hard to get a notion of what the code is about, and tools can be written that will translate the op-codes back to human-friendly strings. In fact, that’s already done by the uncompyle2 package.
Result: distribution of .pyc
files is not an valid mechanism of
protection.
Optimized bytecode
Python has the concept of optimized bytecode, still taking baby steps, we could distribute optimized code which might help to protect the original codebase.
But Python optimizations aren’t real code optimizations, but mere simplifications. There are two levels of “optimizations”:
-
Level one (
-O
flag) This level will discardassert
statements from the code. -
Level two (
-OO
flag) This level will apply the optimizations from level 1, plus it will discarddocstrings
from the code.
So we are back to square one, this time with just a smaller version of .pyc
files.
Result: distribution of .pyo
files is not an valid mechanism of
protection.
Extending the import process
Python supports customization of the import process by the means of import hooks. The interpreter makes use of this feature to implement the import of modules from zip archives.
This concept can be used to import modules from different containers types or
pre-process modules before loading them. Take for instance the following
finder
and loader
implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import os
import imp
import sys
class BaseLoader(object):
def modinfo(self, name, path):
try:
modinfo = imp.find_module(name.rsplit('.', 1)[-1], path)
except ImportError:
if '.' not in name:
raise
clean_path, clean_name = name.rsplit('.', 1)
clean_path = [clean_path.replace('.', '/')]
modinfo = imp.find_module(clean_name, clean_path)
file, pathname, (suffix, mode, type_) = modinfo
if type_ == imp.PY_SOURCE:
filename = pathname
elif type_ == imp.PY_COMPILED:
filename = pathname[:-1]
elif type_ == imp.PKG_DIRECTORY:
filename = os.path.join(pathname, '__init__.py')
else:
return (None, None)
return (filename, modinfo)
class Loader(BaseLoader):
def __init__(self, name, path):
self.name = name
self.path = path
def load_module(self, name):
if name in sys.modules:
return sys.modules[name]
filename, modinfo = self.modinfo(self.name, self.path)
if filename is None and modinfo is None:
return None
file = modinfo[0] or open(filename, 'r')
src = ''.join(file.readlines())
src = self.decrypt(src[len(SIGNATURE):])
module = imp.new_module(name)
module.__file__ = filename
module.__path__ = [os.path.dirname(os.path.abspath(file.name))]
module.__loader__ = self
sys.modules[name] = module
exec(src, module.__dict__)
print "encrypted module loaded: {0}".format(name)
return module
def decrypt(self, input):
return AESCypher(SECRET).decrypt(input)
class Finder(BaseLoader):
def find_module(self, name, path=None):
filename, modinfo = self.modinfo(name, path)
if filename is None and modinfo is None:
return None
file = modinfo[0] or open(filename, 'r')
if file.read(len(SIGNATURE)) == SIGNATURE:
print "encrypted module found: {0} (at {1})".format(name, path)
file.seek(0)
return Loader(name, path)
Note: AESCypher
is defined here.
The loader above is a simple import-hook in sys.meta_path
, it intercepts
imports of encrypted modules, decrypts and returns them.
The process is quite simple, Python will call find_module
, which in
our case will check if the given path and name correspond to an
encrypted module, if that’s the case, an instance of Loader
is
returned, which will take care of the read, decrypt, creation and
return of the module instance.
Of course, this is just a very simple example of what can be done, better
measures must be taken in order to protect the encryption key, also to
avoid access to the application from a shell, otherwise any module can be
inspected and disassembled as shown in previous examples. Customizing the
python interpreter to nullify co_code
is an option to avoid such code
inspection.
python-crypt-import-hooks is a nice (simple) example of how to implement an import hook as a C extension.
Result: it’s worth checking, but at first look it seems that more protection measures are needed.
Compiling modules with Cython
Cython is an static compiler for Python and Cython programming languages, it simplifies the job of writing Python C extensions. The key part here is that allows us to compile Python code, the result are dynamic libraries which can be used as python modules too.
During the import process this is the order of precedence (dates of modification are also taken into account):
- shared library (
.so
,.pyd
) - python bytecode (
.pyo
,.pyc
) - python file (
.py
)
The benefits of using compiled modules are many, specially for protecting the original Python code:
- Binary modules will impose a much harder task to those trying to get the original Python code, reverse engineering techniques must be used to do so.
- There’s no
co_code
to get bytecode, since there’s no bytecode at all, in fact it’s even possible to get rid offunc_code
completely by using Cython directives. - Cython generated C code can be modified to introduce changes, improve protection, etc.
- GCC optimization flags can be used while compiling the library
- Tracebacks won’t reveal code, but just line numbers (unless it’s disabled by directives).
- Processing speedup, Cython takes Python code and translates it to C, which is then compiled by GCC (or similar), the compiled code will run faster than the pure Python version (most of the time).
To cythonize
a codebase, it’s enough to just run Cython + GCC on each module,
__init__.py
files must be left intact to keep module import working.
Take for instance this simple code (in a hello.py
file):
1
2
def hello(name):
print('Hello {0}'.format(name))
In order to compile it it’s enough to:
Cython will spot issues like missing imports, undefined functions during the
convert-to-C stage, but runtime errors might appear to, like the use of dis
module.
I found specially annoying an incompatibility with keyword arguments. Following
the example above, once compiled to an .so
, the function cannot be called
using keyword arguments:
So, depending on the complexity of the codebase, more or less issues will appear, some will be simpler and others will be trickier to solve, and some will mean a sacrifice of code expressiveness in exchange for a good degree of protection.
Result: seems viable enough to properly protect the python code behind an application.
Final thoughts
This article just covers a few attempts to find a solution to the problem. Some do the job, other require more work, in the end Cython remains the more promising option. It’s true that any user will have access to binaries that can be used to reverse engineer the application, but that’s going to take a good amount of time and work.
It’s also possible to combine the different approaches to provide an even more secure environment.
Next time we will be checking the Python interpreter itself and introduce changes in order to make it generate code that’s dependent of the changes in it. Also play with Python Abstract Syntax Trees as an option to scramble code.