What is PhysBinder?
Transcription factor binding sites (TFBSs) are DNA sequences of 6–15 base pairs. Interaction of these TFBSs with transcription factors (TFs) is largely responsible for most spatiotemporal gene expression patterns. PhysBinder is a novel online tool that is based on a flexible and extensible algorithm for the prediction of TFBSs. The algorithm makes use of both direct (the sequence) and indirect readout features (biophysical properties such as the bendability of the DNA) of protein-DNA complexes and significantly outperforms current state of the art approaches for in silico transcription factor binding site identification. Users can submit sequences for analysis in the PhysBinder integrative algorithm and choose from more than 60 different TF binding models. The results of this analysis are shown in an intuitive visualization and offer a way to steer future wet-lab experiments.
How does it work?
The input from which models are built consists of the two classes of nucleotide sequences that the method should learn to separate. One class contains positive sequences known to be bound in vivo; the other contains negative or background sequences, highly unlikely to be bound in vivo. Each nucleotide sequence, from either class, is converted into multiple series of values; each series provides values for a specific DNA structural characteristic at all positions of the TFBS and its context (structural model), or simply consists of one base or two base parts of the sequence (NPD). Basic selection of relevant features (i.e. positions) is made by statistical comparison of distributions of values for positive and negative sequences with mild thresholds. Further selection is performed through wrapper-based feature selection, i.e. cross-validation performance evaluation with the Random Forest algorithm. Per characteristic, redundant features are removed by sequential backwards elimination (SBE). Several models with one characteristic might then be merged. The final NPD model and final structural model can be merged into one integrative model. The resulting model can be used by the Random Forest classifier to predict the likelihood that a nucleotide sequence is a TFBS, after converting the sequence into series of the features contained in the model.