LASCON2016/SpikeSortingTheElementaryWay.html at master · christophe-pouzat/LASCON2016 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<!-- 2016-02-01 lun. 15:13 -->
<meta  http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta  name="viewport" content="width=device-width, initial-scale=1" />
<title>Spike Sorting The Elementary Way</title>
<meta  name="generator" content="Org-mode" />
<meta  name="author" content="Christophe Pouzat" />
<style type="text/css">
 <!--/*--><![CDATA[/*><!--*/
  .title  { text-align: center;
             margin-bottom: .2em; }
  .subtitle { text-align: center;
              font-size: medium;
              font-weight: bold;
              margin-top:0; }
  .todo   { font-family: monospace; color: red; }
  .done   { font-family: monospace; color: green; }
  .priority { font-family: monospace; color: orange; }
  .tag    { background-color: #eee; font-family: monospace;
            padding: 2px; font-size: 80%; font-weight: normal; }
  .timestamp { color: #bebebe; }
  .timestamp-kwd { color: #5f9ea0; }
  .org-right  { margin-left: auto; margin-right: 0px;  text-align: right; }
  .org-left   { margin-left: 0px;  margin-right: auto; text-align: left; }
  .org-center { margin-left: auto; margin-right: auto; text-align: center; }
  .underline { text-decoration: underline; }
  #postamble p, #preamble p { font-size: 90%; margin: .2em; }
  p.verse { margin-left: 3%; }
  pre {
    border: 1px solid #ccc;
    box-shadow: 3px 3px 3px #eee;
    padding: 8pt;
    font-family: monospace;
    overflow: auto;
    margin: 1.2em;
  }
  pre.src {
    position: relative;
    overflow: visible;
    padding-top: 1.2em;
  }
  pre.src:before {
    display: none;
    position: absolute;
    background-color: white;
    top: -10px;
    right: 10px;
    padding: 3px;
    border: 1px solid black;
  }
  pre.src:hover:before { display: inline;}
  pre.src-sh:before    { content: 'sh'; }
  pre.src-bash:before  { content: 'sh'; }
  pre.src-emacs-lisp:before { content: 'Emacs Lisp'; }
  pre.src-R:before     { content: 'R'; }
  pre.src-perl:before  { content: 'Perl'; }
  pre.src-java:before  { content: 'Java'; }
  pre.src-sql:before   { content: 'SQL'; }

  table { border-collapse:collapse; }
  caption.t-above { caption-side: top; }
  caption.t-bottom { caption-side: bottom; }
  td, th { vertical-align:top;  }
  th.org-right  { text-align: center;  }
  th.org-left   { text-align: center;   }
  th.org-center { text-align: center; }
  td.org-right  { text-align: right;  }
  td.org-left   { text-align: left;   }
  td.org-center { text-align: center; }
  dt { font-weight: bold; }
  .footpara { display: inline; }
  .footdef  { margin-bottom: 1em; }
  .figure { padding: 1em; }
  .figure p { text-align: center; }
  .inlinetask {
    padding: 10px;
    border: 2px solid gray;
    margin: 10px;
    background: #ffffcc;
  }
  #org-div-home-and-up
   { text-align: right; font-size: 70%; white-space: nowrap; }
  textarea { overflow-x: auto; }
  .linenr { font-size: smaller }
  .code-highlighted { background-color: #ffff00; }
  .org-info-js_info-navigation { border-style: none; }
  #org-info-js_console-label
    { font-size: 10px; font-weight: bold; white-space: nowrap; }
  .org-info-js_search-highlight
    { background-color: #ffff00; color: #000000; font-weight: bold; }
  /*]]>*/-->
</style>
<script type="text/javascript">
/*
@licstart  The following is the entire license notice for the
JavaScript code in this tag.

Copyright (C) 2012-2013 Free Software Foundation, Inc.

The JavaScript code in this tag is free software: you can
redistribute it and/or modify it under the terms of the GNU
General Public License (GNU GPL) as published by the Free Software
Foundation, either version 3 of the License, or (at your option)
any later version.  The code is distributed WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE.  See the GNU GPL for more details.

As additional permission under GNU GPL version 3 section 7, you
may distribute non-source (e.g., minimized or compacted) forms of
that code without the copy of the GNU GPL normally required by
section 4, provided you include this license notice and a URL
through which recipients can access the Corresponding Source.


@licend  The above is the entire license notice
for the JavaScript code in this tag.
*/
<!--/*--><![CDATA[/*><!--*/
 function CodeHighlightOn(elem, id)
 {
   var target = document.getElementById(id);
   if(null != target) {
     elem.cacheClassElem = elem.className;
     elem.cacheClassTarget = target.className;
     target.className = "code-highlighted";
     elem.className   = "code-highlighted";
   }
 }
 function CodeHighlightOff(elem, id)
 {
   var target = document.getElementById(id);
   if(elem.cacheClassElem)
     elem.className = elem.cacheClassElem;
   if(elem.cacheClassTarget)
     target.className = elem.cacheClassTarget;
 }
/*]]>*///-->
</script>
<script type="text/x-mathjax-config">
    MathJax.Hub.Config({
        displayAlign: "center",
        displayIndent: "0em",

        "HTML-CSS": { scale: 100,
                        linebreaks: { automatic: "false" },
                        webFont: "TeX"
                       },
        SVG: {scale: 100,
              linebreaks: { automatic: "false" },
              font: "TeX"},
        NativeMML: {scale: 100},
        TeX: { equationNumbers: {autoNumber: "AMS"},
               MultLineWidth: "85%",
               TagSide: "right",
               TagIndent: ".8em"
             }
});
</script>
<script type="text/javascript"
        src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script>
</head>
<body>
<div id="content">
<h1 class="title">Spike Sorting The Elementary Way</h1>
<div id="table-of-contents">
<h2>Table of Contents</h2>
<div id="text-table-of-contents">
<ul>
<li><a href="#orgheadline1">1. Downloading the data</a></li>
<li><a href="#orgheadline3">2. Importing the required modules and loading the data</a></li>
<li><a href="#orgheadline7">3. Preliminary analysis</a>
<ul>
<li><a href="#orgheadline4">3.1. Five number summary</a></li>
<li><a href="#orgheadline5">3.2. Were the data normalized?</a></li>
<li><a href="#orgheadline6">3.3. Discretization step amplitude</a></li>
</ul>
</li>
<li><a href="#orgheadline8">4. Plot the data</a></li>
<li><a href="#orgheadline10">5. Data renormalization</a>
<ul>
<li><a href="#orgheadline9">5.1. A quick check that the <code>MAD</code> "does its job"</a></li>
</ul>
</li>
<li><a href="#orgheadline13">6. Detect peaks</a>
<ul>
<li><a href="#orgheadline11">6.1. Interactive spike detection check</a></li>
<li><a href="#orgheadline12">6.2. Split the data set in two parts</a></li>
</ul>
</li>
<li><a href="#orgheadline17">7. Cuts</a>
<ul>
<li><a href="#orgheadline14">7.1. Events</a></li>
<li><a href="#orgheadline15">7.2. Noise</a></li>
<li><a href="#orgheadline16">7.3. Getting "clean" events</a></li>
</ul>
</li>
<li><a href="#orgheadline22">8. Dimension reduction</a>
<ul>
<li><a href="#orgheadline18">8.1. Principal Component Analysis (PCA)</a></li>
<li><a href="#orgheadline19">8.2. Exploring <code>PCA</code> results</a></li>
<li><a href="#orgheadline20">8.3. Static representation of the projected data</a></li>
<li><a href="#orgheadline21">8.4. Dynamic visualization of the data with <code>GGobi</code></a></li>
</ul>
</li>
<li><a href="#orgheadline25">9. Clustering with K-Means</a>
<ul>
<li><a href="#orgheadline23">9.1. Cluster specific plots</a></li>
<li><a href="#orgheadline24">9.2. Results inspection with <code>GGobi</code></a></li>
</ul>
</li>
<li><a href="#orgheadline32">10. Spike "peeling": a "Brute force" superposition resolution</a>
<ul>
<li><a href="#orgheadline26">10.1. First peeling</a></li>
<li><a href="#orgheadline27">10.2. Second peeling</a></li>
<li><a href="#orgheadline28">10.3. Third peeling</a></li>
<li><a href="#orgheadline29">10.4. Fourth peeling</a></li>
<li><a href="#orgheadline30">10.5. Fifth peeling</a></li>
<li><a href="#orgheadline31">10.6. General comparison</a></li>
</ul>
</li>
<li><a href="#orgheadline33">11. Getting the spike trains</a></li>
<li><a href="#orgheadline2">12. Individual function definitions</a>
<ul>
<li><a href="#orgheadline34">12.1. <code>plot_data_list</code></a></li>
<li><a href="#orgheadline35">12.2. <code>peak</code></a></li>
<li><a href="#orgheadline36">12.3. <code>cut_sgl_evt</code></a></li>
<li><a href="#orgheadline37">12.4. <code>mk_events</code></a></li>
<li><a href="#orgheadline38">12.5. <code>plot_events</code></a></li>
<li><a href="#orgheadline39">12.6. <code>plot_data_list_and_detection</code></a></li>
<li><a href="#orgheadline40">12.7. <code>mk_noise</code></a></li>
<li><a href="#orgheadline41">12.8. <code>mad</code></a></li>
<li><a href="#orgheadline44">12.9. <code>mk_aligned_events</code></a>
<ul>
<li><a href="#orgheadline42">12.9.1. The jitter: A worked out example</a></li>
<li><a href="#orgheadline43">12.9.2. Function definition</a></li>
</ul>
</li>
<li><a href="#orgheadline45">12.10. <code>mk_center_dictionary</code></a></li>
<li><a href="#orgheadline46">12.11. <code>classify_and_align_evt</code></a></li>
<li><a href="#orgheadline47">12.12. <code>predict_data</code></a></li>
</ul>
</li>
</ul>
</div>
</div>
<div id="outline-container-orgheadline1" class="outline-2">
<h2 id="orgheadline1"><span class="section-number-2">1</span> Downloading the data</h2>
<div class="outline-text-2" id="text-1">
<p>
The data are available and can be downloaded with (watch out, you must use slightly different commands if you're using <code>Python 2</code>):
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock1">from urllib.request import urlretrieve # Python 3
# from urllib import urlretrieve # Python 2
data_names = ['Locust_' + str(i) + '.dat.gz' for i in range(1,5)]
data_src = ['http://xtof.disque.math.cnrs.fr/data/' + n
            for n in data_names]
[urlretrieve(data_src[i],data_names[i]) for i in range(4)]
</pre>
</div>
<p>
They were stored as floats coded on 64 bits and compressed with <code>gnuzip</code>. So we decompress it:
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock2">import gzip
import shutil
data_snames = ['Locust_' + str(i) + '.dat' for i in range(1,5)]
for in_name, out_name in zip(data_names,data_snames):
    with gzip.open(in_name,'rb') as f_in:
        with open(out_name,'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
</pre>
</div>
<p>
20 seconds of data sampled at 15 kHz are contained in these files (see <a href="http://xtof.perso.math.cnrs.fr/pdf/Pouzat+:2002.pdf">PouzatEtAl_2002</a> for details). Four
files corresponding to the four electrodes or recording sites of a
<i>tetrode</i> (see Sec. why-tetrode) are used.
</p>
</div>
</div>

<div id="outline-container-orgheadline3" class="outline-2">
<h2 id="orgheadline3"><span class="section-number-2">2</span> Importing the required modules and loading the data</h2>
<div class="outline-text-2" id="text-2">
<p>
The individual functions developed for this kind of analysis are defined at the end of this document (Sec. <a href="#orgheadline2">12</a>).
They can also be downloaded as a single file <a href="https://raw.githubusercontent.com/christophe-pouzat/LASCON2016/master/code/sorting_with_python.py">sorting_with_python.py</a> which must then be imported with for instance:
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock3">import sorting_with_python as swp
</pre>
</div>
<p>
where it is assumed that the working directory of your <code>python</code> session is the directory where the file <code>sorting_with_python.py</code> can be found.
We are going to use <code>numpy</code> and <code>pylab</code> (we will also use <code>pandas</code> later on, but to generate only one figure so you can do the analysis without it). We are also going to use the interactive mode of the latter:
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock4">import numpy as np
import matplotlib.pylab as plt
plt.ion()
</pre>
</div>

<p>
<code>Python 3</code> was used to perform this analysis but everything also works with <code>Python 2</code>. We load the data with:
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock5"># Create a list with the file names
data_files_names = ['Locust_' + str(i) + '.dat' for i in range(1,5)]
# Get the lenght of the data in the files
data_len = np.unique(list(map(len, map(lambda n:
                                       np.fromfile(n,np.double),
                                       data_files_names))))[0]
# Load the data in a list of numpy arrays
data = [np.fromfile(n,np.double) for n in data_files_names]
</pre>
</div>
</div>
</div>

<div id="outline-container-orgheadline7" class="outline-2">
<h2 id="orgheadline7"><span class="section-number-2">3</span> Preliminary analysis</h2>
<div class="outline-text-2" id="text-3">
<p>
We are going to start our analysis by some "sanity checks" to make sure that nothing "weird" happened during the recording.
</p>
</div>
<div id="outline-container-orgheadline4" class="outline-3">
<h3 id="orgheadline4"><span class="section-number-3">3.1</span> Five number summary</h3>
<div class="outline-text-3" id="text-3-1">
<p>
We should start by getting an overall picture of the data like the one provided by the <code>mquantiles</code> method of module <code>scipy.stats.mstats</code> using it to output a <a href="http://en.wikipedia.org/wiki/Five-number_summary">five-number summary</a>. The five numbers are the <code>minimum</code>, the <code>first quartile</code>, the <code>median</code>, the <code>third quartile</code> and the <code>maximum</code>:
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock6">from scipy.stats.mstats import mquantiles
np.set_printoptions(precision=3)
[mquantiles(x,prob=[0,0.25,0.5,0.75,1]) for x in data]
</pre>
</div>

<pre class="example">
[array([ -9.074,  -0.371,  -0.029,   0.326,  10.626]),
 array([ -8.229,  -0.45 ,  -0.036,   0.396,  11.742]),
 array([-6.89 , -0.53 , -0.042,  0.469,  9.849]),
 array([ -7.347,  -0.492,  -0.04 ,   0.431,  10.564])]
</pre>


<p>
In the above result, each row corresponds to a recording channel, the first column contains the minimal value; the second, the first quartile; the third, the median; the fourth, the third quartile; the fifth, the maximal value.
We see that the data range (<code>maximum - minimum</code>) is similar (close to 20) on the four recording sites. The inter-quartiles ranges are also similar.
</p>
</div>
</div>

<div id="outline-container-orgheadline5" class="outline-3">
<h3 id="orgheadline5"><span class="section-number-3">3.2</span> Were the data normalized?</h3>
<div class="outline-text-3" id="text-3-2">
<p>
We can check next if some processing like a division by the <i>standard deviation</i> (SD) has been applied:
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock7">[np.std(x) for x in data]
</pre>
</div>

<pre class="example">
[0.99999833333194166,
 0.99999833333193622,
 0.99999833333194788,
 0.99999833333174282]
</pre>


<p>
We see that SD normalization was indeed applied to these data…
</p>
</div>
</div>

<div id="outline-container-orgheadline6" class="outline-3">
<h3 id="orgheadline6"><span class="section-number-3">3.3</span> Discretization step amplitude</h3>
<div class="outline-text-3" id="text-3-3">
<p>
We can easily obtain the size of the digitization set:
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock8">[np.min(np.diff(np.sort(np.unique(x)))) for x in data]
</pre>
</div>

<pre class="example">
[0.0067098450784115471,
 0.0091945001879327748,
 0.011888432902217971,
 0.0096140421286605715]
</pre>
</div>
</div>
</div>

<div id="outline-container-orgheadline8" class="outline-2">
<h2 id="orgheadline8"><span class="section-number-2">4</span> Plot the data</h2>
<div class="outline-text-2" id="text-4">
<p>
Plotting the data for interactive exploration is trivial. The only trick is to add (or subtract) a proper offest (that we get here using the maximal value of each channel from our five-number summary), this is automatically implemented in our <code>plot_data_list</code> function:
</p>


<div class="org-src-container">

<pre class="src src-python">tt = np.arange(0,data_len)/1.5e4
swp.plot_data_list(data,tt,0.1)
</pre>
</div>
<p>
The first channel is drawn as is, the second is offset downward by the sum of its maximal value and of the absolute value of the minimal value of the first, etc. We then get something like Fig. \ref{fig:WholeRawData}.
</p>


<div id="orgparagraph1" class="figure">
<p><img src="figsL1/WholeRawData.png" alt="WholeRawData.png" />
</p>
<p><span class="figure-number">Figure 1:</span> The whole (20 s) Locust antennal lobe data set.</p>
</div>

<p>
It is also good to "zoom in" and look at the data with a finer time scale (Fig. \ref{fig:First200ms}) with:
</p>

<div class="org-src-container">

<pre class="src src-python">plt.xlim([0,0.2])
</pre>
</div>


<div id="orgparagraph2" class="figure">
<p><img src="figsL1/First200ms.png" alt="First200ms.png" />
</p>
<p><span class="figure-number">Figure 2:</span> First 200 ms of the Locust data set.</p>
</div>
</div>
</div>

<div id="outline-container-orgheadline10" class="outline-2">
<h2 id="orgheadline10"><span class="section-number-2">5</span> Data renormalization</h2>
<div class="outline-text-2" id="text-5">
<p>
We are going to use a <a href="http://en.wikipedia.org/wiki/Median_absolute_deviation">median absolute deviation</a> (<code>MAD</code>) based renormalization. The goal of the procedure is to scale the raw data such that the <i>noise SD</i> is approximately 1. Since it is not straightforward to obtain a noise SD on data where both signal (<i>i.e.</i>, spikes) and noise are present, we use this <a href="http://en.wikipedia.org/wiki/Robust_statistics">robust</a> type of statistic for the SD:
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock9">data_mad = list(map(swp.mad,data))
data_mad
</pre>
</div>

<pre class="example">
[0.51729684828925626,
 0.62706123501700972,
 0.74028320607479514,
 0.68418138527772443]
</pre>


<p>
And we normalize accordingly (we also subtract the <code>median</code> which is not exactly 0):
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock10">data = list(map(lambda x: (x-np.median(x))/swp.mad(x), data))
</pre>
</div>
<p>
We can check on a plot (Fig. \ref{fig:site1-with-MAD-and-SD}) how <code>MAD</code> and <code>SD</code> compare:
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock11">plt.plot(tt,data[0],color="black")
plt.xlim([0,0.2])
plt.ylim([-17,13])
plt.axhline(y=1,color="red")
plt.axhline(y=-1,color="red")
plt.axhline(y=np.std(data[0]),color="blue",linestyle="dashed")
plt.axhline(y=-np.std(data[0]),color="blue",linestyle="dashed")
plt.xlabel('Time (s)')
plt.ylim([-5,10])
</pre>
</div>


<div id="orgparagraph3" class="figure">
<p><img src="figsL1/site1-with-MAD-and-SD.png" alt="site1-with-MAD-and-SD.png" />
</p>
<p><span class="figure-number">Figure 3:</span> First 200 ms on site 1 of the Locust data set. In red: +/- the <code>MAD</code>; in dashed blue +/- the <code>SD</code>.</p>
</div>
</div>

<div id="outline-container-orgheadline9" class="outline-3">
<h3 id="orgheadline9"><span class="section-number-3">5.1</span> A quick check that the <code>MAD</code> "does its job"</h3>
<div class="outline-text-3" id="text-5-1">
<p>
We can check that the <code>MAD</code> does its job as a robust estimate of the <i>noise</i> standard deviation by looking at <a href="http://en.wikipedia.org/wiki/Q-Q_plot">Q-Q plots</a> of the whole traces normalized with the <code>MAD</code> and normalized with the "classical" <code>SD</code> (Fig. \ref{fig:check-MAD}):
</p>

<div class="org-src-container">

<pre class="src src-python">dataQ = map(lambda x:
            mquantiles(x, prob=np.arange(0.01,0.99,0.001)),data)
dataQsd = map(lambda x:
              mquantiles(x/np.std(x), prob=np.arange(0.01,0.99,0.001)),
              data)
from scipy.stats import norm
qq = norm.ppf(np.arange(0.01,0.99,0.001))
plt.plot(np.linspace(-3,3,num=100),np.linspace(-3,3,num=100),
         color='grey')
colors = ['black', 'orange', 'blue', 'red']
for i,y in enumerate(dataQ):
    plt.plt.plot(qq,y,color=colors[i])

for i,y in enumerate(dataQsd):
    plt.plot(qq,y,color=colors[i],linestyle="dashed")

plt.xlabel('Normal quantiles')
plt.ylabel('Empirical quantiles')
</pre>
</div>


<div id="orgparagraph4" class="figure">
<p><img src="figsL1/check-MAD.png" alt="check-MAD.png" />
</p>
<p><span class="figure-number">Figure 4:</span> Performances of <code>MAD</code> based vs <code>SD</code> based normalizations. After normalizing the data of each recording site by its <code>MAD</code> (plain colored curves) or its <code>SD</code> (dashed colored curves), Q-Q plot against a standard normal distribution were constructed. Colors: site 1, black; site 2, orange; site 3, blue; site 4, red.</p>
</div>

<p>
We see that the behavior of the "away from normal" fraction is much more homogeneous for small, as well as for large in fact, quantile values with the <code>MAD</code> normalized traces than with the <code>SD</code> normalized ones. If we consider automatic rules like the three sigmas we are going to reject fewer events (<i>i.e.</i>, get fewer putative spikes) with the <code>SD</code> based normalization than with the <code>MAD</code> based one.
</p>
</div>
</div>
</div>

<div id="outline-container-orgheadline13" class="outline-2">
<h2 id="orgheadline13"><span class="section-number-2">6</span> Detect peaks</h2>
<div class="outline-text-2" id="text-6">
<p>
We are going to filter the data slightly using a "box" filter of length 5. That is, the data points of the original trace are going to be replaced by the average of themselves with their four nearest neighbors. We will then scale the filtered traces such that the <code>MAD</code> is one on each recording sites and keep only the parts of the signal above 4:
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock12">from scipy.signal import fftconvolve
from numpy import apply_along_axis as apply
data_filtered = apply(lambda x:
                      fftconvolve(x,np.array([1,1,1,1,1])/5.,'same'),
                      1,np.array(data))
data_filtered = (data_filtered.transpose() / \
                 apply(swp.mad,1,data_filtered)).transpose()
data_filtered[data_filtered &lt; 4] = 0
</pre>
</div>
<p>
We can see the difference between the <i>raw</i> trace and the <i>filtered and rectified</i> one (Fig. \ref{fig:compare-raw-and-filtered-data}) on which spikes are going to be detected with:
</p>

<div class="org-src-container">

<pre class="src src-python">plt.plot(tt, data[0],color='black')
plt.axhline(y=4,color="blue",linestyle="dashed")
plt.plot(tt, data_filtered[0,],color='red')
plt.xlim([0,0.2])
plt.ylim([-5,10])
plt.xlabel('Time (s)')
</pre>
</div>


<div id="orgparagraph5" class="figure">
<p><img src="figsL1/compare-raw-and-filtered-data.png" alt="compare-raw-and-filtered-data.png" />
</p>
<p><span class="figure-number">Figure 5:</span> First 200 ms on site 1 of data set <code>data</code>. The raw data are shown in black, the detection threshold appears in dashed blue and the filtered and rectified trace on which spike detection is going to be preformed appears in red.</p>
</div>

<p>
We now use function <code>peak</code> on the sum of the rows of our filtered and rectified version of the data:
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock13">sp0 = swp.peak(data_filtered.sum(0))
</pre>
</div>

<p>
Giving <code>1795</code> spikes, a mean inter-event interval of <code>167.0</code> sampling points, a standard deviation of <code>144.0</code> sampling points, a smallest inter-event interval of <code>16</code> sampling points and a largest of <code>1157</code> sampling points.
</p>
</div>

<div id="outline-container-orgheadline11" class="outline-3">
<h3 id="orgheadline11"><span class="section-number-3">6.1</span> Interactive spike detection check</h3>
<div class="outline-text-3" id="text-6-1">
<p>
We can then check the detection quality with:
</p>

<div class="org-src-container">

<pre class="src src-python">swp.plot_data_list_and_detection(data,tt,sp0)
plt.xlim([0,0.2])
</pre>
</div>


<div id="orgparagraph6" class="figure">
<p><img src="figsL1/check-spike-detection.png" alt="check-spike-detection.png" />
</p>
<p><span class="figure-number">Figure 6:</span> First 200 ms of data set <code>data</code>. The raw data are shown in black, the detected events are signaled by red dots (a dot is put on each recording site at the amplitude on that site at that time).</p>
</div>
</div>
</div>

<div id="outline-container-orgheadline12" class="outline-3">
<h3 id="orgheadline12"><span class="section-number-3">6.2</span> Split the data set in two parts</h3>
<div class="outline-text-3" id="text-6-2">
<p>
As explained in the text, we want to "emulate" a long data set analysis where the model is estimated on the early part before doing template matching on what follows. We therefore get an "early" and a "late" part by splitting the data set in two:
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock14">sp0E = sp0[sp0 &lt;= data_len/2.]
sp0L = sp0[sp0 &gt; data_len/2.]
</pre>
</div>

<p>
In <code>sp0E</code>, the number of detected events is: <code>908</code> ; the mean inter-event interval is: <code>165.0</code>; the standard deviation of the inter-event intervals is: <code>139.0</code>; the smallest inter-event interval is: <code>16</code> sampling points long; the largest inter-event interval is: <code>931</code> sampling points long.
</p>

<p>
In <code>sp0L</code>, the number of detected events is: <code>887</code>; the mean inter-event interval is: <code>169.0</code>; the standard deviation of the inter-event intervals is: <code>149.0</code>; the smallest inter-event interval is: <code>16</code> sampling points long; the largest inter-event interval is: <code>1157</code> sampling points long.
</p>
</div>
</div>
</div>

<div id="outline-container-orgheadline17" class="outline-2">
<h2 id="orgheadline17"><span class="section-number-2">7</span> Cuts</h2>
<div class="outline-text-2" id="text-7">
<p>
After detecting our spikes, we must make our cuts in order to create our events' sample. The obvious question we must first address is: How long should our cuts be? The pragmatic way to get an answer is:
</p>
<ul class="org-ul">
<li>Make cuts much longer than what we think is necessary, like 50 sampling points on both sides of the detected event's time.</li>
<li>Compute robust estimates of the "central" event (with the <code>median</code>) and of the dispersion of the sample around this central event (with the <code>MAD</code>).</li>
<li>Plot the two together and check when does the <code>MAD</code> trace reach the background noise level (at 1 since we have normalized the data).</li>
<li>Having the central event allows us to see if it outlasts significantly the region where the <code>MAD</code> is above the background noise level.</li>
</ul>

<p>
Clearly cutting beyond the time at which the <code>MAD</code> hits back the noise level should not bring any useful information as far a classifying the spikes is concerned. So here we perform this task as follows:
</p>

<div class="org-src-container">

<pre class="src src-python">evtsE = swp.mk_events(sp0E,np.array(data),49,50)
evtsE_median=apply(np.median,0,evtsE)
evtsE_mad=apply(swp.mad,0,evtsE)
</pre>
</div>

<div class="org-src-container">

<pre class="src src-python">plt.plot(evtsE_median, color='red', lw=2)
plt.axhline(y=0, color='black')
for i in np.arange(0,400,100):
    plt.axvline(x=i, color='black', lw=2)

for i in np.arange(0,400,10):
    plt.axvline(x=i, color='grey')

plt.plot(evtsE_median, color='red', lw=2)
plt.plot(evtsE_mad, color='blue', lw=2)
</pre>
</div>


<div id="orgparagraph7" class="figure">
<p><img src="figsL1/check-MAD-on-long-cuts.png" alt="check-MAD-on-long-cuts.png" />
</p>
<p><span class="figure-number">Figure 7:</span> Robust estimates of the central event (black) and of the sample's dispersion around the central event (red) obtained with "long" (100 sampling points) cuts. We see clearly that the dispersion is back to noise level 15 points before the peak and 30 points after the peak.</p>
</div>

<p>
Fig. \ref{fig:check-MAD-on-long-cuts} clearly shows that starting the cuts 15 points before the peak and ending them 30 points after should fulfill our goals. We also see that the central event slightly outlasts the window where the <code>MAD</code> is larger than 1.
</p>
</div>

<div id="outline-container-orgheadline14" class="outline-3">
<h3 id="orgheadline14"><span class="section-number-3">7.1</span> Events</h3>
<div class="outline-text-3" id="text-7-1">
<p>
Once we are satisfied with our spike detection, at least in a provisory way, and that we have decided on the length of our cuts, we proceed by making <code>cuts</code> around the detected events. :
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock15">evtsE = swp.mk_events(sp0E,np.array(data),14,30)
</pre>
</div>

<p>
We can visualize the first 200 events with:
</p>

<div class="org-src-container">

<pre class="src src-python">swp.plot_events(evtsE,200)
</pre>
</div>


<div id="orgparagraph8" class="figure">
<p><img src="figsL1/first-200-of-evtsE.png" alt="first-200-of-evtsE.png" />
</p>
<p><span class="figure-number">Figure 8:</span> First 200 events of <code>evtsE</code>. Cuts from the four recording sites appear one after the other. The background (white / grey) changes with the site. In red, <i>robust</i> estimate of the "central" event obtained by computing the pointwise median. In blue, <i>robust</i> estimate of the scale (SD) obtained by computing the pointwise <code>MAD</code>.</p>
</div>
</div>
</div>

<div id="outline-container-orgheadline15" class="outline-3">
<h3 id="orgheadline15"><span class="section-number-3">7.2</span> Noise</h3>
<div class="outline-text-3" id="text-7-2">
<p>
Getting an estimate of the noise statistical properties is an essential ingredient to build respectable goodness of fit tests. In our approach "noise events" are essentially anything that is not an "event" is the sense of the previous section. I wrote essentially and not exactly since there is a little twist here which is the minimal distance we are willing to accept between the reference time of a noise event and the reference time of the last preceding and of the first following "event". We could think that keeping a cut length on each side would be enough. That would indeed be the case if <i>all</i> events were starting from and returning to zero within a cut. But this is not the case with the cuts parameters we chose previously (that will become clear soon). You might wonder why we chose so short a cut length then. Simply to avoid having to deal with too many superposed events which are the really bothering events for anyone wanting to do proper sorting.
To obtain our noise events we are going to use function <code>mk_noise</code> which takes the <i>same</i> arguments as function <code>mk_events</code> plus two numbers:
</p>
<ul class="org-ul">
<li><code>safety_factor</code> a number by which the cut length is multiplied and which sets the minimal distance between the reference times discussed in the previous paragraph.</li>
<li><code>size</code> the maximal number of noise events one wants to cut (the actual number obtained might be smaller depending on the data length, the cut length, the safety factor and the number of events).</li>
</ul>

<p>
We cut noise events with a rather large safety factor:
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock16">noiseE = swp.mk_noise(sp0E,np.array(data),14,30,safety_factor=2.5,size=2000)
</pre>
</div>
</div>
</div>

<div id="outline-container-orgheadline16" class="outline-3">
<h3 id="orgheadline16"><span class="section-number-3">7.3</span> Getting "clean" events</h3>
<div class="outline-text-3" id="text-7-3">
<p>
Our spike sorting has two main stages, the first one consist in estimating a <b>model</b> and the second one consists in using this model to <b>classify</b> the data. Our <b>model</b> is going to be built out of reasonably "clean" events. Here by clean we mean events which are not due to a nearly simultaneous firing of two or more neurons; and simultaneity is defined on the time scale of one of our cuts. When the model will be subsequently used to classify data, events are going to decomposed into their (putative) constituent when they are not "clean", that is, <b>superposition are going to be looked and accounted for</b>.
</p>

<p>
In order to eliminate the most obvious superpositions we are going to use a rather brute force approach, looking at the sides of the central peak of our median event and checking if individual events are not too large there, that is do not exhibit extra peaks. We first define a function doing this job:
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock17">def good_evts_fct(samp, thr=3):
    samp_med = apply(np.median,0,samp)
    samp_mad = apply(swp.mad,0,samp)
    above = samp_med &gt; 0
    samp_r = samp.copy()
    for i in range(samp.shape[0]): samp_r[i,above] = 0
    samp_med[above] = 0
    res = apply(lambda x:
                np.all(abs((x-samp_med)/samp_mad) &lt; thr),
                1,samp_r)
    return res
</pre>
</div>

<p>
We then apply our new function to our sample using a threshold of 8 (set by trial and error):
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock18">goodEvts = good_evts_fct(evtsE,8)
</pre>
</div>

<p>
Out of <code>898</code> events we get <code>843</code> "good" ones. As usual, the first 200 good ones can be visualized with:
</p>

<div class="org-src-container">

<pre class="src src-python">swp.plot_events(evtsE[goodEvts,:][:200,:])
</pre>
</div>


<div id="orgparagraph9" class="figure">
<p><img src="figsL1/first-200-clean-of-evtsE.png" alt="first-200-clean-of-evtsE.png" />
</p>
<p><span class="figure-number">Figure 9:</span> First 200 "good" events of <code>evtsE</code>. Cuts from the four recording sites appear one after the other. The background (white / grey) changes with the site. In red, <i>robust</i> estimate of the "central" event obtained by computing the pointwise median. In blue, <i>robust</i> estimate of the scale (SD) obtained by computing the pointwise <code>MAD</code>.</p>
</div>

<p>
The "bad" guys can be visualized with:
</p>

<div class="org-src-container">

<pre class="src src-python">swp.plot_events(evtsE[goodEvts.__neg__(),:],
                show_median=False,
                show_mad=False)
</pre>
</div>


<div id="orgparagraph10" class="figure">
<p><img src="figsL1/bad-of-evtsE.png" alt="bad-of-evtsE.png" />
</p>
<p><span class="figure-number">Figure 10:</span> The  <code>50</code> "bad" events of <code>evtsE</code>. Cuts from the four recording sites appear one after the other. The background (white / grey) changes with the site.</p>
</div>
</div>
</div>
</div>

<div id="outline-container-orgheadline22" class="outline-2">
<h2 id="orgheadline22"><span class="section-number-2">8</span> Dimension reduction</h2>
<div class="outline-text-2" id="text-8">
</div>

<div id="outline-container-orgheadline18" class="outline-3">
<h3 id="orgheadline18"><span class="section-number-3">8.1</span> Principal Component Analysis (PCA)</h3>
<div class="outline-text-3" id="text-8-1">
<p>
Our events are living right now in an 180 dimensional space (our cuts are 45 sampling points long and we are working with 4 recording sites simultaneously). It turns out that it hard for most humans to perceive structures in such spaces. It also hard, not to say impossible with a realistic sample size, to estimate probability densities (which is what model based clustering algorithms are actually doing) in such spaces, unless one is ready to make strong assumptions about these densities. It is therefore usually a good practice to try to reduce the dimension of the <a href="http://en.wikipedia.org/wiki/Sample_space">sample space</a> used to represent the data. We are going to that with <a href="http://en.wikipedia.org/wiki/Principal_component_analysis">principal component analysis</a> (<code>PCA</code>), using it on our "good" events.
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock19">from numpy.linalg import svd
varcovmat = np.cov(evtsE[goodEvts,:].T)
u, s, v = svd(varcovmat)
</pre>
</div>

<p>
With this "back to the roots" approach, <code>u</code> should be an orthonormal matrix whose column are made of the <code>principal components</code> (and <code>v</code> should be the transpose of <code>u</code> since our matrix <code>varcovmat</code> is symmetric and real by construction). <code>s</code> is a vector containing the amount of sample variance explained by each principal component.
</p>
</div>
</div>

<div id="outline-container-orgheadline19" class="outline-3">
<h3 id="orgheadline19"><span class="section-number-3">8.2</span> Exploring <code>PCA</code> results</h3>
<div class="outline-text-3" id="text-8-2">
<p>
<code>PCA</code> is a rather abstract procedure to most of its users, at least when they start using it. But one way to grasp what it does is to plot the <code>mean event</code> plus or minus, say five times, each principal components like:
</p>

<div class="org-src-container">

<pre class="src src-python">evt_idx = range(180)
evtsE_good_mean = np.mean(evtsE[goodEvts,:],0)
for i in range(4):
    plt.subplot(2,2,i+1)
    plt.plot(evt_idx,evtsE_good_mean, 'black',evt_idx,
             evtsE_good_mean + 5 * u[:,i],
             'red',evt_idx,evtsE_good_mean - 5 * u[:,i], 'blue')
    plt.title('PC' + str(i) + ': ' + str(round(s[i]/sum(s)*100)) +'%')
</pre>
</div>


<div id="orgparagraph11" class="figure">
<p><img src="figsL1/explore-evtsE-PC0to3.png" alt="explore-evtsE-PC0to3.png" />
</p>
<p><span class="figure-number">Figure 11:</span> PCA of <code>evtsE</code> (for "good" events) exploration (PC 1 to 4). Each of the 4 graphs shows the mean waveform (black), the mean waveform + 5 x PC (red), the mean - 5 x PC (blue) for each of the first 4 PCs. The fraction of the total variance "explained" by the component appears in the title of each graph.</p>
</div>

<p>
We can see on Fig. \ref{fig:explore-evtsE-PC0to3} that the first 3 PCs correspond to pure amplitude variations. An event with a large projection (<code>score</code>) on the first PC is smaller than the average event on recording sites 1, 2 and 3, but not on 4. An event with a large projection on PC 1 is larger than average on site 1, smaller than average on site 2 and 3 and identical to the average on site 4. An event with a large projection on PC 2 is larger than the average on site 4 only. PC 3 is the first principal component corresponding to a change in <i>shape</i> as opposed to <i>amplitude</i>. A large projection on PC 3 means that the event as a shallower first valley and a deeper second valley than the average event on all recording sites.
</p>

<p>
We now look at the next 4 principal components:
</p>

<div class="org-src-container">

<pre class="src src-python">for i in range(4,8):
    plt.subplot(2,2,i-3)
    plt.plot(evt_idx,evtsE_good_mean, 'black',
             evt_idx,evtsE_good_mean + 5 * u[:,i], 'red',
             evt_idx,evtsE_good_mean - 5 * u[:,i], 'blue')
    plt.title('PC' + str(i) + ': ' + str(round(s[i]/sum(s)*100)) +'%')
</pre>
</div>


<div id="orgparagraph12" class="figure">
<p><img src="figsL1/explore-evtsE-PC4to7.png" alt="explore-evtsE-PC4to7.png" />
</p>
<p><span class="figure-number">Figure 12:</span> PCA of <code>evtsE</code> (for "good" events) exploration (PC 4 to 7). Each of the 4 graphs shows the mean waveform (black), the mean waveform + 5 x PC (red), the mean - 5 x PC (blue). The fraction of the total variance "explained" by the component appears in between parenthesis in the title of each graph.</p>
</div>

<p>
An event with a large projection on PC 4 (Fig. \ref{fig:explore-evtsE-PC4to7}) tends to be "slower" than the average event. An event with a large projection on PC 5 exhibits a slower kinetics of its second valley than the average event. PC 4 and 5 correspond to effects shared among recording sites. PC 6 correspond also to a "change of shape" effect on all sites except the first. Events with a large projection on PC 7 rise slightly faster and decay slightly slower than the average event on all recording site. Notice also that PC 7 has a "noisier" aspect than the other suggesting that we are reaching the limit of the "events extra variability" compared to the variability present in the background noise.
</p>

<p>
This guess can be confirmed by comparing the variance of the "good" events sample with the one of the noise sample to which the variance contributed by the first K PCs is added:
</p>

<div class="org-src-container">

<pre class="src src-python">noiseVar = sum(np.diag(np.cov(noiseE.T)))
evtsVar = sum(s)
[(i,sum(s[:i])+noiseVar-evtsVar) for i in range(15)]
</pre>
</div>

<pre class="example">
[(0, -577.55150481947305),
 (1, -277.46515432919722),
 (2, -187.56341162342278),
 (3, -128.03907765900999),
 (4, -91.318669099617864),
 (5, -58.839887602314093),
 (6, -36.36306744692456),
 (7, -21.543722414005629),
 (8, -8.2644951775207574),
 (9, 0.28488929424531761),
 (10, 6.9067335500932359),
 (11, 13.341548838374251),
 (12, 19.472089099226878),
 (13, 25.255335647533229),
 (14, 29.102104713041399)]
</pre>

<p>
This suggests that keeping the first 10 PCs should be more than enough.
</p>
</div>
</div>

<div id="outline-container-orgheadline20" class="outline-3">
<h3 id="orgheadline20"><span class="section-number-3">8.3</span> Static representation of the projected data</h3>
<div class="outline-text-3" id="text-8-3">
<p>
We can build a <code>scatter plot matrix</code> showing the projections of our "good" events sample onto the plane defined by pairs of the few first PCs:
</p>

<div class="org-src-container">

<pre class="src src-python" id="orgsrcblock20">evtsE_good_P0_to_P3 = np.dot(evtsE[goodEvts,:],u[:,0:4])
from pandas.tools.plotting import scatter_matrix
import pandas as pd
df = pd.DataFrame(evtsE_good_P0_to_P3)
scatter_matrix(df,alpha=0.2,s=4,c='k',figsize=(6,6),
               diagonal='kde',marker=".")
</pre>
</div>


<div id="orgparagraph13" class="figure">
<p><img src="figsL1/Fig4.png" alt="Fig4.png" />
</p>
<p><span class="figure-number">Figure 13:</span> Scatter plot matrix of the projections of the good events in <code>evtsE</code> onto the planes defined by the first 4 PCs. The diagonal shows a smooth (Gaussian kernel based) density estimate of the projection of the sample on the corresponding PC. Using the first 8 PCs does not make finner structure visible.</p>
</div>
</div>
</div>