Genomics Use-case

Note 04/03/2020 - this needs updating

The genomics use-case involves multiple Faasm functions:

  • gene/mapper - the top-level entry function. This spawns a child to handle each chunk of reads data.
  • gene/mapper_index[1-n] - each of these functions handles the mapping for a different chunk of the index.

There will be as many gene/mapperindex functions as there are chunks of the index. A basic division of the human genome will be into chromosomes, in this case we will have 24 mapper_index functions. These functions will _each get called once per chunk of reads data.

Data

You can download the genomics data then upload to your Faasm instance with:

  1. inv data.genomics-download-s3
  2.  
  3. # Use --local-copy if running locally
  4. inv genomics.upload-data --local-copy

The genomics data is shared via Faasm's shared files rather than directly through shared state.

WASM

To build the genomics library to WASM, build and upload the functions you can run:

  1. # Build
  2. ./bin/clean_genomics.sh
  3. ./bin/build_genomics.sh
  4.  
  5. # Upload
  6. inv upload.genomics
  7.  
  8. # Invoke for a single read chunk
  9. inv invoke gene mapper --input=1
  10.  
  11. # Invoke in a loop for all read chunks
  12. inv genomics.mapping

Native

Note, if you're building native and wasm in the same directory, be sure to clean when switching.

First you need to install libfaasm natively:

  1. inv libs.native

One that's set up, you can run the following:

  1. ./bin/build_genomics_native.sh

The repo itself then describes how to use this code.

Data and Indexing

The index and reads only need to be set up once and uploaded to S3. To do this you need a native build of the indexer (described above). Then you can run:

  1. # Download the data
  2. inv genomics.download-reads
  3. inv genomics.download-genome
  4.  
  5. # Run the indexing
  6. inv genomics.index-genome
  7.  
  8. # Do the upload
  9. inv data.genomics-upload-s3

Mapping

To map a reads file you can do the following:

  1. ./bin/gem-mapper -I data/human_c_20_idx.gem -i data/reads_1.fq -o data/my_output.sam

You can change threads with -t. Adding -t 1 can be useful for debugging.

Misc

Lots of animal genomes at this FTP server.

See the readme for the file layout. Can add more in the download_genome.py script.

This page also has stuff: https://www.ensembl.org/Homo_sapiens/Info/Index (good for individual chromosomes).

CLion

  • Add a new native toolchain (Settings -> Build, Execution, Deployment -> Toolchains)
  • Add a new custom build target (along with a new build tool for make under the "build" field)
  • Create a new run configuration for this target
  • Have it run bin/gem-indexer with the relevant input/ output files

Internals

Mapping is handled through mapper.c which calls mapper_run. For each thread it creates amapping_stats_t and a mapper_search_t.

Threads are either a mapper_pe_thread or a mapper_se_thread, these are just differenttypes of mapping and also live in mapper.c. mapper_se_thread is default.

The mapper parameters tell each thread which files it's dealing with, which thread numberit is etc.