Graph and Stack data (AWS S3)

The Fragment Graph database needs to be loaded with fragment data before it can be used and the Fragalysis Stack needs to be loaded with target/media data before it can be used.

This data typically resides on an NFS server or an S3 object store.

The graph database pulls the data down as it initialises but the stack, once running, needs to be loaded. The stack is often re-loaded with more data as needs dictate.

Note

You need to have access to this data before you can deploy the graph or the stack.

In order to access AWS S3 data you will need to provide the graph and stack loader with AWS credentials that have suitable permissions. For AWS your policy will need s3:Get* and s3:List* permissions for the buckets and paths you intend to use.

As an example, users of our buckets are given the following policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*"
            ],
            "Resource": "arn:aws:s3:::im-fragnet/combination/3/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*"
            ],
            "Resource": "arn:aws:s3:::im-fragalysis/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:List*"
            ],
            "Resource": "arn:aws:s3:::im-fragnet"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:List*"
            ],
            "Resource": "arn:aws:s3:::im-fragalysis"
        }
    ]
}

Graph data

Graph data consists of Node and Relationship definitions, typically stored in a series of compressed CSV files.

On S3 this data is stored as objects on a path. As an example you might have fragalysis-graph bucket and the path combination/3. The bucket and path are not important but the Graph’s initialisation (consisting of a load phase) will simply copy all the objects on the path (not recursively) and expect them to represent a viable graph.

You must have the following files/objects on the path: -

  • load-neo4j.sh

Stack data

Stack data consists of Target definitions, typically stored on a bucket data origin. the files are peculiar to the Fragalysis application. Their their format is not covered here.

Stack data is stored in bucket directories, normally named using the format <YYYY>-<MM>-<DD>T<HH> but other directories may also exist. If you have target data in the data directory 2020-09-15T16 this is your data origin.

Target data must exist in the follwing bucket and path: -

  • s3://<DATA_BUCKET>/jango-data/<DATA_ORIGIN>