diff --git a/data/pre-processing_examples/Example_1.fastq.gz b/data/pre-processing_examples/Example_1.fastq.gz
new file mode 100644
index 0000000000000000000000000000000000000000..523e940d016b964a6c0702d5c6539330baf47bf6
Binary files /dev/null and b/data/pre-processing_examples/Example_1.fastq.gz differ
diff --git a/data/pre-processing_examples/Example_2.fastq.gz b/data/pre-processing_examples/Example_2.fastq.gz
new file mode 100644
index 0000000000000000000000000000000000000000..369d336b213c5d3f2aa31cb97045c88231a54849
Binary files /dev/null and b/data/pre-processing_examples/Example_2.fastq.gz differ
diff --git a/data/pre-processing_examples/Example_3.fastq.gz b/data/pre-processing_examples/Example_3.fastq.gz
new file mode 100644
index 0000000000000000000000000000000000000000..6ae7215ae92c3d790e53a1948eb8c45bc3b6d45a
Binary files /dev/null and b/data/pre-processing_examples/Example_3.fastq.gz differ
diff --git a/scripts/Pre-processing.ipynb b/scripts/Pre-processing.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..4556c74ce2115a19ab680dfe7e22a4b9c4bdd9dc
--- /dev/null
+++ b/scripts/Pre-processing.ipynb
@@ -0,0 +1,882 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a64ebf77-9027-4dfe-973b-36c2c721a491",
+   "metadata": {},
+   "source": [
+    "# Notebook on pre-processing functions (`parse_reads` or `demultiplex`) of 5'/3'-ends RNA-Seq data\n",
+    "\n",
+    "The 3 datasets required for this notebook are provided in the [data](../data) directory of the GitLab project.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "acf99143-ad96-4fbe-be2a-d0b3c6c3ab6a",
+   "metadata": {},
+   "source": [
+    "# Load the EMOTE-tk librarie from the Rsource file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "23c68491-3667-48a6-b38c-2a6092a1874a",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "── Attaching core tidyverse packages ────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──\n",
+      "✔ dplyr     1.1.4     ✔ readr     2.1.5\n",
+      "✔ forcats   1.0.0     ✔ stringr   1.5.1\n",
+      "✔ ggplot2   3.4.4     ✔ tibble    3.2.1\n",
+      "✔ lubridate 1.9.3     ✔ tidyr     1.3.0\n",
+      "✔ purrr     1.0.2     \n",
+      "── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──\n",
+      "✖ dplyr::filter() masks stats::filter()\n",
+      "✖ dplyr::lag()    masks stats::lag()\n",
+      "ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n",
+      "Loading required package: GenomeInfoDb\n",
+      "\n",
+      "Loading required package: BiocGenerics\n",
+      "\n",
+      "\n",
+      "Attaching package: ‘BiocGenerics’\n",
+      "\n",
+      "\n",
+      "The following objects are masked from ‘package:lubridate’:\n",
+      "\n",
+      "    intersect, setdiff, union\n",
+      "\n",
+      "\n",
+      "The following objects are masked from ‘package:dplyr’:\n",
+      "\n",
+      "    combine, intersect, setdiff, union\n",
+      "\n",
+      "\n",
+      "The following objects are masked from ‘package:stats’:\n",
+      "\n",
+      "    IQR, mad, sd, var, xtabs\n",
+      "\n",
+      "\n",
+      "The following objects are masked from ‘package:base’:\n",
+      "\n",
+      "    anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste,\n",
+      "    pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min\n",
+      "\n",
+      "\n",
+      "Loading required package: S4Vectors\n",
+      "\n",
+      "Loading required package: stats4\n",
+      "\n",
+      "\n",
+      "Attaching package: ‘S4Vectors’\n",
+      "\n",
+      "\n",
+      "The following objects are masked from ‘package:lubridate’:\n",
+      "\n",
+      "    second, second<-\n",
+      "\n",
+      "\n",
+      "The following objects are masked from ‘package:dplyr’:\n",
+      "\n",
+      "    first, rename\n",
+      "\n",
+      "\n",
+      "The following object is masked from ‘package:tidyr’:\n",
+      "\n",
+      "    expand\n",
+      "\n",
+      "\n",
+      "The following object is masked from ‘package:utils’:\n",
+      "\n",
+      "    findMatches\n",
+      "\n",
+      "\n",
+      "The following objects are masked from ‘package:base’:\n",
+      "\n",
+      "    expand.grid, I, unname\n",
+      "\n",
+      "\n",
+      "Loading required package: IRanges\n",
+      "\n",
+      "\n",
+      "Attaching package: ‘IRanges’\n",
+      "\n",
+      "\n",
+      "The following object is masked from ‘package:lubridate’:\n",
+      "\n",
+      "    %within%\n",
+      "\n",
+      "\n",
+      "The following objects are masked from ‘package:dplyr’:\n",
+      "\n",
+      "    collapse, desc, slice\n",
+      "\n",
+      "\n",
+      "The following object is masked from ‘package:purrr’:\n",
+      "\n",
+      "    reduce\n",
+      "\n",
+      "\n",
+      "Loading required package: GenomicRanges\n",
+      "\n",
+      "Loading required package: Biostrings\n",
+      "\n",
+      "Loading required package: XVector\n",
+      "\n",
+      "\n",
+      "Attaching package: ‘XVector’\n",
+      "\n",
+      "\n",
+      "The following object is masked from ‘package:purrr’:\n",
+      "\n",
+      "    compact\n",
+      "\n",
+      "\n",
+      "\n",
+      "Attaching package: ‘Biostrings’\n",
+      "\n",
+      "\n",
+      "The following object is masked from ‘package:base’:\n",
+      "\n",
+      "    strsplit\n",
+      "\n",
+      "\n",
+      "Loading required package: BiocParallel\n",
+      "\n",
+      "Loading required package: GenomicAlignments\n",
+      "\n",
+      "Loading required package: SummarizedExperiment\n",
+      "\n",
+      "Loading required package: MatrixGenerics\n",
+      "\n",
+      "Loading required package: matrixStats\n",
+      "\n",
+      "\n",
+      "Attaching package: ‘matrixStats’\n",
+      "\n",
+      "\n",
+      "The following object is masked from ‘package:dplyr’:\n",
+      "\n",
+      "    count\n",
+      "\n",
+      "\n",
+      "\n",
+      "Attaching package: ‘MatrixGenerics’\n",
+      "\n",
+      "\n",
+      "The following objects are masked from ‘package:matrixStats’:\n",
+      "\n",
+      "    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins, colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads, colMaxs, colMeans2, colMedians,\n",
+      "    colMins, colOrderStats, colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds, colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, colWeightedMeans, colWeightedMedians, colWeightedSds,\n",
+      "    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet, rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs,\n",
+      "    rowMeans2, rowMedians, rowMins, rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, rowWeightedMads, rowWeightedMeans, rowWeightedMedians,\n",
+      "    rowWeightedSds, rowWeightedVars\n",
+      "\n",
+      "\n",
+      "Loading required package: Biobase\n",
+      "\n",
+      "Welcome to Bioconductor\n",
+      "\n",
+      "    Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation(\"Biobase\")', and for packages 'citation(\"pkgname\")'.\n",
+      "\n",
+      "\n",
+      "\n",
+      "Attaching package: ‘Biobase’\n",
+      "\n",
+      "\n",
+      "The following object is masked from ‘package:MatrixGenerics’:\n",
+      "\n",
+      "    rowMedians\n",
+      "\n",
+      "\n",
+      "The following objects are masked from ‘package:matrixStats’:\n",
+      "\n",
+      "    anyMissing, rowMedians\n",
+      "\n",
+      "\n",
+      "\n",
+      "Attaching package: ‘GenomicAlignments’\n",
+      "\n",
+      "\n",
+      "The following object is masked from ‘package:dplyr’:\n",
+      "\n",
+      "    last\n",
+      "\n",
+      "\n",
+      "\n",
+      "Attaching package: ‘ShortRead’\n",
+      "\n",
+      "\n",
+      "The following object is masked from ‘package:dplyr’:\n",
+      "\n",
+      "    id\n",
+      "\n",
+      "\n",
+      "The following object is masked from ‘package:purrr’:\n",
+      "\n",
+      "    compose\n",
+      "\n",
+      "\n",
+      "The following object is masked from ‘package:tibble’:\n",
+      "\n",
+      "    view\n",
+      "\n",
+      "\n",
+      "Loading required package: splines\n",
+      "\n",
+      "Loading required package: survival\n",
+      "\n",
+      "Loading required package: prodlim\n",
+      "\n",
+      "\n",
+      "Attaching package: ‘survcomp’\n",
+      "\n",
+      "\n",
+      "The following object is masked from ‘package:VGAM’:\n",
+      "\n",
+      "    fisherz\n",
+      "\n",
+      "\n",
+      "Loading required package: bsseq\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "options(width = 250)\n",
+    "options(crayon.enabled = FALSE)\n",
+    "source(\"../src/emote-tk.R\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a139229-5f41-48ba-9408-fe5417e9ab16",
+   "metadata": {},
+   "source": [
+    "# Extraction of the mappable part of reads"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7d8d915b-6d84-4099-8085-74661a0966d7",
+   "metadata": {},
+   "source": [
+    "Let's have a look on raw reads we want to parse:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "e28fdd55-ad64-456f-80a6-155be69b6fbd",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "DNAStringSet object of length 10000:\n",
+       "        width seq\n",
+       "    [1]    33 AGGAGAAGAGCGGTTCAGCAGGAATGCCGAGAC\n",
+       "    [2]    33 AGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG\n",
+       "    [3]    33 AGGCCAGCGACGCGAAGTAGAATCAGTAATTTG\n",
+       "    [4]    33 AGGCAAGGAACGCCATGCGAGAGCGGTATTATC\n",
+       "    [5]    33 AGGGGTTAAGTTATTAAGGGCGCACGGTGGATG\n",
+       "    ...   ... ...\n",
+       " [9996]    33 AGGGCAAAAACGCGCACAAAAAATGACCAAGAA\n",
+       " [9997]    33 AGGAGGGACAGCACCGCTCTTCCGATCTTAAGC\n",
+       " [9998]    33 AGGGTCCCCCGCTTATTGATATGCAAGATGAAG\n",
+       " [9999]    33 AGGCGAGGCACGCTCAGTCAAGCTGATTTAAAT\n",
+       "[10000]    33 AGGGGGCGCGCGCTAAAAGCTACGCACGTTTTT"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "fq_streamer = FastqStreamer(\"../data/pre-processing_examples/Example_1.fastq.gz\")\n",
+    "sr <- yield(fq_streamer)\n",
+    "sr@sread"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "282ab09c-fcc3-4f58-acc4-d90dc87f9c6d",
+   "metadata": {},
+   "source": [
+    "For this example we have 10000 reads that are 33 nucleotides long that should be built that way:\n",
+    "<pre>            \n",
+    "        1 3         4      10     11 13      14                  33\n",
+    "        AGG         - VVVVVVV -    CGC      - XXXXXXXXXXXXXXXXXXXXX\n",
+    "recognition.seq     -   UMI   - control.seq -  RNA 5'-end (readseq)\n",
+    "</pre>\n",
+    "\n",
+    "- The 3 first nucleotides correspond to a **recognition.sequence** which should always be `AGG`\n",
+    "- The 7 next nucleotides are a random sequence constituting a Unique Molecule Identifier (**UMI**) where the only constraint is that it must not contain any T\n",
+    "- Then it should be an exact sequence again (**control.seq**) that correspond to `CGC`\n",
+    "- Finally, we have the **5'-end of the transcript**\n",
+    "\n",
+    "In order to extract only the part that correspond to the 5'-end of the transcript, we have to fill an `EMOTE_features` table using the `EMOTE_read_features` function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "c1bf29b6-53ca-4861-ac8c-4937203e0bd0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rf = EMOTE_read_features(start = 14, width = 19)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ccf3cd26-68b7-4f61-9a9e-61fa19504dad",
+   "metadata": {},
+   "source": [
+    "By default, only A, C, T or G are allowed in the nucleotides sequence. It is possible to change the number of mismatch allowed as well as the allowed nucleotides/characters of the `readseq` sequence by modifiying some parameters as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "c2c6d370-dd19-47d0-9bdc-7e168d1d8c34",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<table class=\"dataframe\">\n",
+       "<caption>A EMOTE_features: 1 × 7</caption>\n",
+       "<thead>\n",
+       "\t<tr><th scope=col>name</th><th scope=col>start</th><th scope=col>width</th><th scope=col>pattern_type</th><th scope=col>pattern</th><th scope=col>max_mismatch</th><th scope=col>readid_prepend</th></tr>\n",
+       "\t<tr><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;lgl&gt;</th></tr>\n",
+       "</thead>\n",
+       "<tbody>\n",
+       "\t<tr><td>readseq</td><td>14</td><td>19</td><td>1</td><td>ACG</td><td>1</td><td>FALSE</td></tr>\n",
+       "</tbody>\n",
+       "</table>\n"
+      ],
+      "text/latex": [
+       "A EMOTE\\_features: 1 × 7\n",
+       "\\begin{tabular}{lllllll}\n",
+       " name & start & width & pattern\\_type & pattern & max\\_mismatch & readid\\_prepend\\\\\n",
+       " <chr> & <dbl> & <dbl> & <dbl> & <chr> & <dbl> & <lgl>\\\\\n",
+       "\\hline\n",
+       "\t readseq & 14 & 19 & 1 & ACG & 1 & FALSE\\\\\n",
+       "\\end{tabular}\n"
+      ],
+      "text/markdown": [
+       "\n",
+       "A EMOTE_features: 1 × 7\n",
+       "\n",
+       "| name &lt;chr&gt; | start &lt;dbl&gt; | width &lt;dbl&gt; | pattern_type &lt;dbl&gt; | pattern &lt;chr&gt; | max_mismatch &lt;dbl&gt; | readid_prepend &lt;lgl&gt; |\n",
+       "|---|---|---|---|---|---|---|\n",
+       "| readseq | 14 | 19 | 1 | ACG | 1 | FALSE |\n",
+       "\n"
+      ],
+      "text/plain": [
+       "  name    start width pattern_type pattern max_mismatch readid_prepend\n",
+       "1 readseq 14    19    1            ACG     1            FALSE         "
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "EMOTE_read_features(start = 14, width = 19, pattern = \"ACG\", max_mismatch = 1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "84ce999b-a51f-4dc7-b518-4468fe0fdb85",
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "ERROR",
+     "evalue": "Error in `mutate()`:\nℹ In argument: `pc_valid = is_valid/total_read`.\nCaused by error:\n! object 'is_valid' not found\n",
+     "output_type": "error",
+     "traceback": [
+      "Error in `mutate()`:\nℹ In argument: `pc_valid = is_valid/total_read`.\nCaused by error:\n! object 'is_valid' not found\nTraceback:\n",
+      "1. EMOTE_parse_read(fastq_file = \"../data/pre-processing_examples/Example_1.fastq.gz\", \n .     features = rf, force = T)",
+      "2. mutate(stat_tb, pc_valid = is_valid/total_read) %>% mutate(demux_filename = paste0(out_dir, \n .     \"/\", fq_basename, \"_\", tolower(group), \".fastq.gz\"))",
+      "3. mutate(., demux_filename = paste0(out_dir, \"/\", fq_basename, \n .     \"_\", tolower(group), \".fastq.gz\"))",
+      "4. mutate(stat_tb, pc_valid = is_valid/total_read)",
+      "5. mutate.data.frame(stat_tb, pc_valid = is_valid/total_read)",
+      "6. mutate_cols(.data, dplyr_quosures(...), by)",
+      "7. withCallingHandlers(for (i in seq_along(dots)) {\n .     poke_error_context(dots, i, mask = mask)\n .     context_poke(\"column\", old_current_column)\n .     new_columns <- mutate_col(dots[[i]], data, mask, new_columns)\n . }, error = dplyr_error_handler(dots = dots, mask = mask, bullets = mutate_bullets, \n .     error_call = error_call, error_class = \"dplyr:::mutate_error\"), \n .     warning = dplyr_warning_handler(state = warnings_state, mask = mask, \n .         error_call = error_call))",
+      "8. mutate_col(dots[[i]], data, mask, new_columns)",
+      "9. mask$eval_all_mutate(quo)",
+      "10. eval()",
+      "11. .handleSimpleError(function (cnd) \n  . {\n  .     local_error_context(dots, i = frame[[i_sym]], mask = mask)\n  .     if (inherits(cnd, \"dplyr:::internal_error\")) {\n  .         parent <- error_cnd(message = bullets(cnd))\n  .     }\n  .     else {\n  .         parent <- cnd\n  .     }\n  .     message <- c(cnd_bullet_header(action), i = if (has_active_group_context(mask)) cnd_bullet_cur_group_label())\n  .     abort(message, class = error_class, parent = parent, call = error_call)\n  . }, \"object 'is_valid' not found\", base::quote(NULL))",
+      "12. h(simpleError(msg, call))",
+      "13. abort(message, class = error_class, parent = parent, call = error_call)",
+      "14. signal_abort(cnd, .file)"
+     ]
+    }
+   ],
+   "source": [
+    "EMOTE_parse_read(\n",
+    "            fastq_file  = \"../data/pre-processing_examples/Example_1.fastq.gz\",\n",
+    "            features = rf,\n",
+    "            force = T)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9bb4d784-8ae4-49a8-b648-8972a19f1e12",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fq_streamer = FastqStreamer(\"../data/pre-processing_examples/Example_1_valid.fastq.gz\")\n",
+    "sr <- yield(fq_streamer)\n",
+    "sr@sread"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2bdc6812-0684-4767-bb95-cacfa31e252f",
+   "metadata": {},
+   "source": [
+    "We see that only the sequence from position 14 to 33 are present in the out fastq file"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "00392ee1-4176-4ccb-8de3-5345bd6c3d98",
+   "metadata": {},
+   "source": [
+    "# Extraction of the mappable part of **valid** reads"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9374d0b8-e3ac-4732-936b-e47f7d55a7d1",
+   "metadata": {},
+   "source": [
+    "In addition to what we have done before we can extract the 5'-ends of the transcripts only for the reads which are valid by testing the validity of each expected elements we described earlier. <br>\n",
+    "To do that let's add some feature to our EMOTE_features table with the **EMOTE_add_read_feature** function:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b13d7122-84f3-4ca4-a17f-0c6cb79ca173",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rf = EMOTE_add_read_feature(rf, name = \"Recognition.seq\", start = 1, width = 3, pattern = \"AGG\" , pattern_type = 2)\n",
+    "rf = EMOTE_add_read_feature(rf, name = \"Control.seq\", start = 11, width = 3, pattern = \"CGC\", pattern_type = 2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a54f7041-4a64-498e-8a0b-112aa5806b68",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5aee3e6d-672c-4a4a-9ff6-6d4f64276f7b",
+   "metadata": {},
+   "source": [
+    "Following the read structure we described, we define the start and the width and the expected sequence of the 2 features **Recognition.seq** and **control.seq**.\n",
+    "As these features should have an **exact** sequence at the given positions we set the pattern_type parameters = 2 <br>\n",
+    "In fact the pattern_type parameters can have 3 values:\n",
+    "- **1** if the pattern is a string of allowed characters\n",
+    "- **2** if the pattern is an exact sequence (or a vector of exact sequence)\n",
+    "- **3** if the pattern is a regular expression which, once spotted, is trimmed with whatever is behind it."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0a6affb9-4f5a-42eb-a014-1fdb2064d68b",
+   "metadata": {},
+   "source": [
+    "Then we will add the last feature we want to check which is the **UMI**:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6000958d-e2c8-437a-896a-7ddf9bcd3bf9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rf = EMOTE_add_read_feature(rf, name = \"UMI\", start = 4, width = 7, pattern = \"ACG\", pattern_type = 1, readid_prepend = T)\n",
+    "\n",
+    "rf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b7b009b5-6bed-4b19-b7f8-56bd439b7be9",
+   "metadata": {},
+   "source": [
+    "As for other features we set the name, start and width of the feature. <br>\n",
+    "As this feature is a random sequence with a list of allowed characters (A, C or G but not T) we set the pattern_type to 1 and the pattern to \"ACG\"\n",
+    "\n",
+    "We also set the **readid_prepend** to TRUE in order to put the identified UMI in the read identifier (we will check this later)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ed26d4f1-1ab5-4c8d-97ce-9c1d004981dc",
+   "metadata": {},
+   "source": [
+    "Now we define all feature we expected to found along the reads we can (like the previous example), run the **EMOTE_parse_read** function to exact the readseq sequence of read that have all the feature valid"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3b8ccb4d-98ad-487f-b577-37a09749f5ea",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "report = EMOTE_parse_read(\n",
+    "            fastq_file  = \"../data/pre-processing_examples/Example_1.fastq.gz\",\n",
+    "            features = rf,\n",
+    "            force = T)\n",
+    "report"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "13ebb809-7fbf-42b4-95a9-a3caffd57fe5",
+   "metadata": {},
+   "source": [
+    "The **EMOTE_parse_read function** return a parse report that provide some statistics about the validity of the checked feature. <br>\n",
+    "For example here, we see that among the 10000 input reads, 4820 were fully valid while 5163 were invalid due to an invalid feature. <br>\n",
+    "We also see that most of the invalid reads have a valid **Recognition.seq** and a valid **UMI** but very few reads have a valid **control.seq** indicating that the main reason for the invalidness of reads is due to the invalidity of the control.seq."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9347cb6f-24c7-49e1-9f7d-b915d6c18efd",
+   "metadata": {},
+   "source": [
+    "Let's have a look to out fastq to see if we only have the part of the read that correspond to the readseq sequence:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a42ea938-09f8-49cb-946d-eaeb53de509a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fq_streamer = FastqStreamer(\"../data/pre-processing_examples/Example_1_valid.fastq.gz\")\n",
+    "sr <- yield(fq_streamer)\n",
+    "sr@sread"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c0d7a261-b1cb-4348-93ca-3fc60cceb841",
+   "metadata": {},
+   "source": [
+    "Let's have also a look if the UMI sequence have correctly been put in the read identifiers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a8ae5dcd-3eef-4c81-8d09-2520d9c0b8ad",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sr@id"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe4f9f6d-e1af-46a9-8bb7-f5d9aa820125",
+   "metadata": {},
+   "source": [
+    "# Extraction of the mappable part of reads according to a barcode sequence"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3265cfdf-4ae8-4605-8cbb-71c1fd56f418",
+   "metadata": {},
+   "source": [
+    "In order to multiplex multiple experimental conditions into one sequencing run, it is possible to ligate a barcode sequence to the ends of the transcript.\n",
+    "For this example we have 10000 reads that are 24 nucleotides long that should be built that way:\n",
+    "<pre>            \n",
+    "        1  4        5                 24\n",
+    "        CAAG/TCGG - XXXXXXXXXXXXXXXXXXXX\n",
+    "         barcode  -   RNA 5'-end (readseq)\n",
+    "</pre>\n",
+    "\n",
+    "- The 4 first nucleotide correspond to a **barcode sequence** which should always be either \"CAAG\" or \"TCGG\"\n",
+    "- Then we have the **5'-end of the transcript**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ccf94585-eff7-4fc6-b781-f4e461309d75",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fq_streamer = FastqStreamer(\"../data/pre-processing_examples/Example_2.fastq.gz\")\n",
+    "sr <- yield(fq_streamer)\n",
+    "sr@sread"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5bee9dd0-0ae7-45f2-a118-0e8e04f860d4",
+   "metadata": {},
+   "source": [
+    "Let's build the EMOTE_features table in order to extact the mappable part of reads and regroup them into different output fastq file according to the barcode sequence.\n",
+    "\n",
+    "We being the mappable part (the 5'-end of RNA), the add the barcode feature as follow:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d644a577-2a85-4470-b7a2-1a16b200bd37",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rf = EMOTE_read_features(start = 5, width = 20)\n",
+    "rf = EMOTE_add_read_feature(rf, name = \"barcode\", start = 1, width = 4, pattern = c(\"TCGG\",\"CAAG\") , pattern_type = 2, readid_prepend = F)\n",
+    "\n",
+    "rf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b0fc19bc-37d4-4028-9c76-4de2e1bcb23a",
+   "metadata": {},
+   "source": [
+    "Nota that it is **mandatory** to name the feature you want to use for demultiplexing: \"barcode\" <br>\n",
+    "- start position and the width are set according to the read structure we defined. <br>\n",
+    "- the pattern parameters is a vector containing the 2 allowed barcode sequences (\"TCGG\" and \"CAAG\"). <br>\n",
+    "- The pattern_type is set to 2 because the pattern correspond to exact sequences"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2489d2f0-c967-4881-aff3-5954ef132b76",
+   "metadata": {},
+   "source": [
+    "Now we have the EMOTE_feature table, we can extract the mappable part of reads which have a valid barcode sequence and split them into different output fastq file according to the feature named \"barcode\" using the **EMOTE_demultiplex** function:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0d9359ed-ae1c-4e2d-b005-eaf269b446b8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "EMOTE_demultiplex(\n",
+    "            fastq_file  = \"../data/pre-processing_examples/Example_2.fastq.gz\",\n",
+    "            features = rf\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "328486dd-8f0e-4e45-998a-db92391a0641",
+   "metadata": {},
+   "source": [
+    "**EMOTE_demultiplex** also return a report with statistics about the validity of features. <br>\n",
+    "These stats are regrouped by barcode\n",
+    "\n",
+    "**Explication table ?**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bb194c43-4b5e-4356-a3f0-4e599b3f9a60",
+   "metadata": {},
+   "source": [
+    "Once again if we take a look to one of the two output file we see that we only kept the part we wanted to extract."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cb6ab678-8bb7-4ef4-9168-1a6988bcc1f8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fq_streamer = FastqStreamer(\"../data/pre-processing_examples/Example_2_demux/Example_2_TCGG_valid.fastq.gz\")\n",
+    "sr <- yield(fq_streamer)\n",
+    "sr@sread"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "009b2b76-0ab0-49fc-b95d-063fe1420655",
+   "metadata": {},
+   "source": [
+    "# Removing of a pattern from the mappable part of reads"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6754034a-1427-4e56-bcf0-8677822016cf",
+   "metadata": {},
+   "source": [
+    "For this example, we have 10000 reads that have no specific structure but due to a step during the experimental protocol, reads can have a poly A sequence that should be removed in order to map on the reference genome"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6c933c7e-0124-48d9-a4e9-11a9778da695",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fq_streamer = FastqStreamer(\"../data/pre-processing_examples/Example_3.fastq.gz\")\n",
+    "sr <- yield(fq_streamer)\n",
+    "sr@sread"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "92e35290-5bc6-4125-a3cc-4ad2cd2e2ef8",
+   "metadata": {},
+   "source": [
+    "We see that several reads have a succession of A that can't map on a genome. <br>\n",
+    "So let's parse them in order to remove these poly A sequences using **EMOTE_parse_read** but in a first instance we have to build an **EMOTE_features table**: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bd2100af-f92d-4cb6-8199-6c4b945bb434",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rf = EMOTE_read_features(start = 1, width = 50)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e1106b6d-c1df-444f-a498-47329849e753",
+   "metadata": {},
+   "source": [
+    "This time, the positions of the sequence to extract is variable since the positions of poly A are also variables. <br>\n",
+    "So we set the start position to 1 and the width to 50 (the entire read sequence), and the extraction of the mappable part will be done after the removal of the poly A."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "efe14dc6-1219-4a88-8d6f-2ec1e3a07cec",
+   "metadata": {},
+   "source": [
+    "Now let's add the poly A feature to our EMOTE_features table:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "233819ea-b139-4950-a735-8800952aa769",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rf = EMOTE_add_read_feature(rf, name = \"PolyA\",start = 1,width = 50, pattern = \"AAAAAA.+\" , pattern_type = 3, readid_prepend = F)\n",
+    "rf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1d4d080c-1bec-490d-b8f9-7b00891ad48f",
+   "metadata": {},
+   "source": [
+    "As the poly A sequence can be found anywhere on the reads we set the start position to 1 and the width to 50. <br>\n",
+    "The pattern corresponds to a regular expression, here define a poly A as a succession of minimum 6 A. <br>\n",
+    "The pattern being a regular expression to identity and remove with whatever is behind it, we set the pattern_type to 3 "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e25ca83a-36a5-4441-96c7-7c6fd9757220",
+   "metadata": {},
+   "source": [
+    "With this EMOTE_features table we can perform an **EMOTE_parse_read**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3377e5b5-be74-481c-bfa5-96e5b63d608e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "report = EMOTE_parse_read(\n",
+    "            fastq_file  = \"../data/pre-processing_examples/Example_3.fastq.gz\",\n",
+    "            features = rf,\n",
+    "            force = T)\n",
+    "report"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "56cb8e86-5ff9-4175-9dc2-3b7b34b20d2c",
+   "metadata": {},
+   "source": [
+    "The report indicated that among the 10000 reads, only 8212 were valids due to the validty of readseq. <br>\n",
+    "Note that the invalidy of readseq may also be due to too short read after the removal of the poly A."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3ce07100-5c95-47fe-a7e3-d71729e854b3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fq_streamer = FastqStreamer(\"../data/pre-processing_examples/Example_3_valid.fastq.gz\")\n",
+    "sr <- yield(fq_streamer)\n",
+    "sr@sread"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5f6e1198-aea7-44c2-ab11-b7fa3cd88793",
+   "metadata": {},
+   "source": [
+    "As we can see, Poly A are removed from the reads, generating reads of different lengths. <br>\n",
+    "It is possible to set a minimal length allowed in the **EMOTE_parse_read** function (default is 18)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "R [conda env:emote-tk-devel-shared-env-R4.3.2] *",
+   "language": "R",
+   "name": "conda-env-emote-tk-devel-shared-env-R4.3.2-r"
+  },
+  "language_info": {
+   "codemirror_mode": "r",
+   "file_extension": ".r",
+   "mimetype": "text/x-r-source",
+   "name": "R",
+   "pygments_lexer": "r",
+   "version": "4.3.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}