2 files changed, 113 insertions, 0 deletions
diff --git a/docs/develop/index.rst b/docs/develop/index.rst
index 64dbc4661..fff3e43d7 100644
--- a/docs/develop/index.rst
+++ b/docs/develop/index.rst
@@ -10,3 +10,4 @@ See the `getting started guide
    :maxdepth: 1
 
    cmodules.rst
+   qstr.rst
diff --git a/docs/develop/qstr.rst b/docs/develop/qstr.rst
new file mode 100644
index 000000000..1b3b9f903
--- /dev/null
+++ b/docs/develop/qstr.rst
@@ -0,0 +1,112 @@
+MicroPython string interning
+============================
+
+MicroPython uses `string interning`_ to save both RAM and ROM.  This avoids
+having to store duplicate copies of the same string.  Primarily, this applies to
+identifiers in your code, as something like a function or variable name is very
+likely to appear in multiple places in the code.  In MicroPython an interned
+string is called a QSTR (uniQue STRing).
+
+A QSTR value (with type ``qstr``) is a index into a linked list of QSTR pools.
+QSTRs store their length and a hash of their contents for fast comparison during
+the de-duplication process.  All bytecode operations that work with strings use
+a QSTR argument.
+
+Compile-time QSTR generation
+----------------------------
+
+In the MicroPython C code, any strings that should be interned in the final
+firmware are written as ``MP_QSTR_Foo``.  At compile time this will evaluate to
+a ``qstr`` value that points to the index of ``"Foo"`` in the QSTR pool.
+
+A multi-step process in the ``Makefile`` makes this work.  In summary this
+process has three parts:
+
+1. Find all ``MP_QSTR_Foo`` tokens in the code.
+
+2. Generate a static QSTR pool containing all the string data (including lengths
+   and hashes).
+
+3. Replace all ``MP_QSTR_Foo`` (via the preprocessor) with their corresponding
+   index.
+
+``MP_QSTR_Foo`` tokens are searched for in two sources:
+
+1. All files referenced in ``$(SRC_QSTR)``.  This is all C code (i.e. ``py``,
+   ``extmod``, ``ports/stm32``) but not including third-party code such as
+   ``lib``.
+
+2. Additional ``$(QSTR_GLOBAL_DEPENDENCIES)`` (which includes ``mpconfig*.h``).
+
+*Note:* ``frozen_mpy.c`` (generated by mpy-tool.py) has its own QSTR generation
+and pool.
+
+Some additional strings that can't be expressed using the ``MP_QSTR_Foo`` syntax
+(e.g. they contain non-alphanumeric characters) are explicitly provided in
+``qstrdefs.h`` and ``qstrdefsport.h`` via the ``$(QSTR_DEFS)`` variable.
+
+Processing happens in the following stages:
+
+1. ``qstr.i.last`` is the concatenation of putting every single input file
+   through the C pre-processor.  This means that any conditionally disabled code
+   will be removed, and macros expanded.  This means we don't add strings to the
+   pool that won't be used in the final firmware.  Because at this stage (thanks
+   to the ``NO_QSTR`` macro added by ``QSTR_GEN_EXTRA_CFLAGS``) there is no
+   definition for ``MP_QSTR_Foo`` it passes through this stage unaffected.  This
+   file also includes comments from the preprocessor that include line number
+   information.  Note that this step only uses files that have changed, which
+   means that ``qstr.i.last`` will only contain data from files that have
+   changed since the last compile.
+2. ``qstr.split`` is an empty file created after running ``makeqstrdefs.py split``
+   on qstr.i.last. It's just used as a dependency to indicate that the step ran.
+   This script outputs one file per input C file,  ``genhdr/qstr/...file.c.qstr``,
+   which contains only the matched QSTRs. Each QSTR is printed as ``Q(Foo)``.
+   This step is necessary to combine the existing files with the new data
+   generated from the incremental update in ``qstr.i.last``.
+
+3. ``qstrdefs.collected.h`` is the output of concatenating ``genhdr/qstr/*``
+   using ``makeqstrdefs.py cat``.  This is now the full set of ``MP_QSTR_Foo``'s
+   found in the code, now formatted as ``Q(Foo)``, one-per-line, with duplicates.
+   This file is only updated if the set of qstrs has changed.  A hash of the QSTR
+   data is written to another file (``qstrdefs.collected.h.hash``) which allows
+   it to track changes across builds.
+
+4. ``qstrdefs.preprocessed.h`` adds in the QSTRs from qstrdefs*.  It
+   concatenates ``qstrdefs.collected.h`` with ``qstrdefs*.h``, then it transforms
+   each line from ``Q(Foo)`` to ``"Q(Foo)"`` so they pass through the preprocessor
+   unchanged.  Then the preprocessor is used to deal with any conditional
+   compilation in ``qstrdefs*.h``.  Then the transformation is undone back to
+   ``Q(Foo)``, and saved as ``qstrdefs.preprocessed.h``.
+
+5. ``qstrdefs.generated.h`` is the output of ``makeqstrdata.py``.  For each
+   ``Q(Foo)`` in qstrdefs.preprocessed.h (plus some extra hard-coded ones), it outputs
+   ``QDEF(MP_QSTR_Foo, (const byte*)"hash" "Foo")``.
+
+Then in the main compile, two things happen with ``qstrdefs.generated.h``:
+
+1. In qstr.h, each QDEF becomes an entry in an enum, which makes ``MP_QSTR_Foo``
+   available to code and equal to the index of that string in the QSTR table.
+
+2. In qstr.c, the actual QSTR data table is generated as elements of the
+   ``mp_qstr_const_pool->qstrs``.
+
+.. _`string interning`: https://en.wikipedia.org/wiki/String_interning
+
+Run-time QSTR generation
+------------------------
+
+Additional QSTR pools can be created at runtime so that strings can be added to
+them. For example, the code::
+
+  foo[x] = 3
+
+Will need to create a QSTR for the value of ``x`` so it can be used by the
+"load attr" bytecode.
+
+Also, when compiling Python code, identifiers and literals need to have QSTRs
+created.  Note: only literals shorter than 10 characters become QSTRs.  This is
+because a regular string on the heap always takes up a minimum of 16 bytes (one
+GC block), whereas QSTRs allow them to be packed more efficiently into the pool.
+
+QSTR pools (and the underlying "chunks" that store the string data) are allocated
+on-demand on the heap with a minimum size.