#######################
Fragmentation Processes
#######################

..  epigraph::

    The fragmentation process runs in Kubernetes and is able to process
    vendor molecule data into a form that can be used in our graph database
    and fragalysis.

************
Architecture
************

Fragmentation is implemented and described in our `fragmentor`_ repository.

Converting vendor data into a form suitable for a graph database requires the
use of two container images (a *player* and a *fragmentor*). these images are
built from the `fragmentor`_ repository's code. The process also
relies on a PostgreSQL database.

The process is divided into a number of "plays" that *standardise* the vendor
data, *fragment* it (into a database) and then *extract* the fragments to form
a graph.

The whole process is managed from a workstation using the playbook provided
in the `fragmentor-ansible`_ repository.

The actual "plays" are orchestrated by the *player* container launched from
the user's workstation. It launches a number of *fragmentor* containers in
order to distribute the workload amongst a number of parallel processes that
share a common volume for data processing.

AWS S3 is used as a source of vendor data and where the final results are
written as illustrated in the following diagram: -

..  image:: ../images/im-kubernetes-fragmentor/im-kubernetes-fragmentor.001.png

..  note::
    The role of this section of the documentation is not to explain the actual
    process but to explain how to execute it, within a Kubernetes cluster.

************
Installation
************

Our `fragmentor-ansible`_ repository contains a detailed description of
the installation and execution of the fragmentor and it describes what you'll
need to do to process a small set of molecules (up to a few thousand).
Refer to it for up-to-date instructions but you will need: -

1.  A kubernetes **Namespace**
2.  A **ReadWriteMany** storage class (like NFS or EFS)
3.  Nodes with our **Node Labels** for the fragmentation Pods
4.  A **postgres database** (and user)
5.  An AWS **Bucket**
6.  Plenty of spare cores and memory

Follow the **Kubernetes namespace setup** section of the repo's README
for a fuller description of how to setup the cluster. The section provides
an example set of parameters that you can use.

Support for private-registry images
===================================

If you need to source database images from a private repository and need
to provide an **ImagePullSecret** for them you can provide those as
additional parameters::

    pg_image_registry: my-registry
    pg_image_name: postgres
    pg_image_version: '12.2'
    all_image_preset_pullsecret_name: my-secret

..  warning::
    If you're expect to process a large number of molecules you'll need to
    consult with us to understand what preparation you'll need before you
    embark on any fragmentation as using the basic setup described will not
    be suitable.

******************
Running the player
******************

With the cluster setup (and database installed in the **Namespace**) you
should be able to run the fragmentor plays [#f1]_. This is
documented in the `fragmentor-ansible`_ repository README's
**Running a play** section.

This essentially requires the one-time preparation of the database
(handled by the player) and then running the fragmentation process,
which entails the use of the following plays: -

-   **standardise**
-   **fragment**
-   **inchi**
-   **extract**

Refer to the `fragmentor`_ repository for further details of the
S3 and data requirements and a description of the basic process.

Support for private-registry images
===================================

If you need to source the player and fragmentation images from a private
repository and need to provide an **ImagePullSecret** for them you can provide
those as additional parameters in the player's parameter file::

    # Details of the 'player' image
    fp_image_registry: my-registry
    fp_image_name: informaticsmatters/fragmentor-player
    fp_image_tag: '1.0.0'
    # Details of the 'fragmentor' image
    nextflow_container_registry: my-registry
    nextflow_container_name: informaticsmatters/fragmentor
    nextflow_container_tag: sdf-05
    # The common pull-secret
    all_image_preset_pullsecret_name: my-secret


.. rubric:: Footnotes

.. [#f1] You will need molecule data in a supported format stored in your
         bucket

.. _fragmentor: https://github.com/InformaticsMatters/fragmentor
.. _fragmentor-ansible: https://github.com/InformaticsMatters/fragmentor-ansible