Non-Complementary-Sequence-Generator

This project is a Python program intended to generate sequences with minimal complementarity to a provided set of input sequences, for use cases such as designing linker sequences in oligonucleotides which should not bind strongly to other sequences present in the system. I wrote it primarily as practice with Python and for my own personal use, but have made it available here in case others also find it useful. Other tools for this task also exist, such as NUPACK’s Design functionality. See the repository for code, instructions, and example inputs.

Microsoft Copilot was used throughout to assist with writing, but was not used to generate any of the final code.

Principles

The first important concept for how the tool operates is the complementarity array. In order to avoid having to scan the entirety of the input sequences every time a generated nucleotide must be scored, the program converts them into a complementarity array, an array with k axes each of dimension n, where n is the number of nucleotides being checked against and k the comp_threshold. Each position in this array corresponds to the maximum complementarity the corresponding tuple of k nucleotides has to any k-mer in the input seqeunces. This allows the tool’s time dependence to be independent of the length of the input sequences (ignoring the initial time required to generate the complementarity array to begin with), at the cost of rapid increases in time and memory required when either the number of different nucleotides or the size of the k-tuples considered increases, for example due to use of non-canonical nucleotides (n) or longer generated sequences for which a larger comp_threshold is required to achieve good results (k). The tool also maintains local complementarity arrays for each candidate in the beam (see below), which allows avoiding complementarity of generated sequences to themselves but exacerbates the dependence on k and n.

The second important concept is vector-based scoring. In order to be able to account for non-canonical nucleotides such as inosine, which may pair with varying strength to different nucleotides, and to account for G:U wobble base pairing in RNA, scoring by the number of nucleotides that are exact matches is not sufficient. Instead, the tool uses a comp_dictionary, which links each pair of nucleotides (that should be considered to have any complementarity) to a score, which should be between 0 and 1. The complementarity score for a pair of tuples is determined by iterating through each pair and accumulating two vectors, a primary vector and a direction vector, initialized as a zero-vector and a unit vector respectively. At each base pair, the complementarity of that base pair is multiplied by the direction vector and the result added to the primary vector; then, the direction vector is rotated by an angle θ into a new dimension, where cos(θ) is (value*0.9 + 0.1), where value is the complementarity of the base pair. Thus, if the base pair has a complementarity of 1, θ is 0° and no rotation occurs; if the base pair has a complementarity of 0, θ is close to 90° and the direction vector is mostly rotated into a new dimension, retaining a little of its previous components to permit accounting for non-contiguous complementarity; and intermediate values produce intermediate behaviours. (Note however that I have not done much testing of the quality of results the tool provides when values other than 0 and 1 are provided - exercise caution when using this feature in particular.)

The third important concept is beam search, which is a widely-used search method particularly in machine learning. Beam search operates by maintaining a certain number of candidates, known as the beam; at each step, every possible expansion to each candidate in the beam is generated and scored, and then the beam_size candidates with the highest score - regardless of which candidate they derived from - are retained for the next round. This avoids getting stuck in dead ends while keeping computational load manageable. (Note that greedy search and exhaustive search are both special cases of beam search, with a beam size of 0 or ∞ respectively.)

The Sequence Framework

The tool uses an internal class called SequenceFramework, which stores a framework of sequences and gaps. The internal data storage of that class is a nested list in the following format:

Framework: The top-level list. Contains strands.
Strand: List containing two items, a seq_strand followed by a property dictionary.
Seq_Strand: List containing a series of elements.
Element: List of the form [int, str, int], where the string is the currently stored sequence (5’ to 3’) and the flanking integers are the number of nucleotides to be expanded from that central sequence in the 5’ and 3’ directions respectively.
Property Dictionary: Dictionary currently containing two key-value pairs, ‘circ’ (Boolean, is the strand circular) and ‘nts’ (list of one-character strings, nucleotides to be used for generating sequence within the strand).

The starting framework can be submitted in this format. However, for ease of use, the class also accepts some other formats. The input format must always have a top-level list (Framework) and sublists (strands), but strands do not need to include a seq_strand and property_dict; instead, the property_dict can if desired be included as one of the items in the strand at the same level as the sequence and gap information.

Depending on the value of framework_input_format, strands can encode sequence and gap information in two ways.

‘element’ mode: Strands must contain elements of [int, str, int].
‘strand’ mode: Strands contain integers and strings directly. The class will process the sequence to a list of elements, filling in gaps where possible. If only a single integer is provided between two strings, the gap will be assigned to the longer string (to minimise the risk of extending from a string shorter than comp_threshold, which will result in incorporation of random nucleotides until comp_threshold is reached). Strings at the ends of a strand will be assumed to have 0 nucleotides to extend in that direction.

Variance Scoring

In addition to scoring by complementarity, the tool also scores by sequence diversity, encouraging generated sequences to retain an even balance of different nucleotides if this does not conflict with complementarity. There is currently no provision for turning this feature off, and I have not done testing to see whether it is actually necessary; I originally added it to discourage a behaviour an early version of the tool showed where it would provide highly repetitive sequences (e.g. “ACCACCACCACCACCACCACCACC…”), but that was before I implemented the ability to account for self-complementarity and before I implemented the vector scoring system, either of which might naturally resolve most cases of this issue.

Robin-JE-Bradfield's homepage

Bookmark this to keep an eye on my project updates!

Non-Complementary-Sequence-Generator