Commit b2163dfb authored by Loic Huder's avatar Loic Huder
Browse files

Merge remote-tracking branch 'origin/master'

parents 63175989 457d68e0
......@@ -14,17 +14,17 @@
"source": [
"Algorithm on series (*e.g.* time series, character strings, ...)\n",
"\n",
"A series can be either a one dimensional ordered series of items (e.g. an numerical array, a string).\n",
"A series is a one dimensional ordered series of items (e.g. an numerical array, a string).\n",
"\n",
"We want to compute a dissimilarity measure between series. The measure can either apply to series of \n",
"same length or not, and can be a metric (i.e. symmetric, $d(x, y) = 0 \\iff x = y$, triangular inequality). \n",
"\n",
"We consider two dissimilarity measures with different features. \n",
"We consider two (dis)similarity measures with different features. \n",
"\n",
"S1 and S2 two series of length |S1| and |S2|\n",
"\n",
"- **dtw**: dynamic time wrapping, complexity O(|S1|*|S2|)\n",
"- **cort**: normalized cosine distances between derivatives, complexity O(|S1| + |S2|)\n",
"- **dtw**: similarity measure: dynamic time wrapping, complexity O(|S1|*|S2|)\n",
"- **cort**: normalized cosine similarity measure between derivatives, complexity O(|S1| + |S2|)\n",
"\n"
]
},
......@@ -206,7 +206,7 @@
"\n",
"**Input**: two series S1 and S2 *of same length*\n",
"\n",
"**Output:** a dissimilarity measure\n",
"**Output:** a similarity measure\n",
"\n",
"**Complexity:** O(|S1|+|S2|)\n",
"\n",
......@@ -215,21 +215,9 @@
"What do we compute: \n",
"-------------------------------\n",
"\n",
"The cosine distance between derivatives of the series. \n",
"\n",
"normalize \n",
"\n",
" num = 0.0\n",
" sum_square_x = 0.0\n",
" sum_square_y = 0.0\n",
" for t in range(len(s1) - 1):\n",
" slope_1 = s1[t + 1] - s1[t]\n",
" slope_2 = s2[t + 1] - s2[t]\n",
" num += slope_1 * slope_2\n",
" sum_square_x += slope_1 * slope_1\n",
" sum_square_y += slope_2 * slope_2\n",
" return num / (np.sqrt(sum_square_x * sum_square_y))\n",
"The cosine similarity measure between derivatives of the series. \n",
"\n",
" \n",
"\n"
]
},
......@@ -243,23 +231,11 @@
"$$ \\begin{eqnarray}\n",
"cort(A, B) &=& \\cos(dA, dB) \\\\\n",
"&=& \\frac{dA \\cdot dB}{\\Vert dA\\Vert \\Vert dB\\Vert} \\\\\n",
"&=& \\frac{\\sum_{i=0}^T dA_i dB_i}{\\Vert dA\\Vert \\Vert dB\\Vert} \\\\\n",
"&=& \\frac{\\sum_{i=0}^T (A_{i+1}-A_i) (B_{i+1}-B_i)}{\\sqrt{\\sum_{i=0}^T (A_{i+1}-A_i)^2} \\sqrt{\\sum_{i=0}^T (B_{i+1}-B_i)^2}}\n",
"&=& \\frac{\\sum_{i=0}^{T} dA_i dB_i}{\\Vert dA\\Vert \\Vert dB\\Vert} \\\\\n",
"&=& \\frac{\\sum_{i=0}^{T-1} (A_{i+1}-A_i) (B_{i+1}-B_i)}{\\sqrt{\\sum_{i=0}^{T-1} (A_{i+1}-A_i)^2} \\sqrt{\\sum_{i=0}^{T-1} (B_{i+1}-B_i)^2}}\n",
"\\end{eqnarray} $$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What do we compute:\n",
"------------------------------\n",
"\n",
"<div align=\"middle\">\n",
"<img src=\"./fig/cort_formule.png\" style=\"width: 75%\">\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
......@@ -270,7 +246,7 @@
},
{
"cell_type": "code",
"execution_count": 28,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
......@@ -300,25 +276,22 @@
},
{
"cell_type": "code",
"execution_count": 29,
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.4629100498862757"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
"name": "stdout",
"output_type": "stream",
"text": [
"cort(x,2*x)=1.0 cort([1,2], [2,1])=-1.0\n"
]
}
],
"source": [
"x = [1, 2, 3, 5, 5, 6]\n",
"y = [1, 1, 2, 2, 3, 5]\n",
"\n",
"cort(x,y)"
"print(f\"cort(x,2*x)={cort(x, 2*x)} cort([1,2], [2,1])={cort([1,2], [2,1])}\")"
]
}
],
......@@ -338,7 +311,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
"version": "3.7.3"
}
},
"nbformat": 4,
......
%% Cell type:markdown id: tags:
A running example
================
%% Cell type:markdown id: tags:
Algorithm on series (*e.g.* time series, character strings, ...)
A series can be either a one dimensional ordered series of items (e.g. an numerical array, a string).
A series is a one dimensional ordered series of items (e.g. an numerical array, a string).
We want to compute a dissimilarity measure between series. The measure can either apply to series of
same length or not, and can be a metric (i.e. symmetric, $d(x, y) = 0 \iff x = y$, triangular inequality).
We consider two dissimilarity measures with different features.
We consider two (dis)similarity measures with different features.
S1 and S2 two series of length |S1| and |S2|
- **dtw**: dynamic time wrapping, complexity O(|S1|*|S2|)
- **cort**: normalized cosine distances between derivatives, complexity O(|S1| + |S2|)
- **dtw**: similarity measure: dynamic time wrapping, complexity O(|S1|*|S2|)
- **cort**: normalized cosine similarity measure between derivatives, complexity O(|S1| + |S2|)
%% Cell type:markdown id: tags:
Dynamic Time Wrapping
------------------------------------
- **Input:** two series, S1 and S2, not necessarily of same length
- **Output:** a dissimilarity measure
- **Complexity:** O([S1|*|S2|)
- **Metric:** no: does not respect the triangular inequality
- **Side product:** an alignment between the series can be stored
What do we compute:
-------------------------------
The transformation (with minimal cost) to transform one serie in the other one.
Example of what is computed
-------------------------------------------
%% Cell type:markdown id: tags:
Example of result of dtw :
-----------------------------------
<div align="middle">
<img src="./fig/dtw_example.png" style="width: 100%">
</div>
%% Cell type:markdown id: tags:
Idea of the algorithm
------------------------------
Inspired from
https://riptutorial.com/dynamic-programming/example/25780/introduction-to-dynamic-time-warping
Let $a$ and $b$ be two series. We have:
- dtw is a dynamic programming algorithm: the solution is built incrementally
- a table $t$ is incrementally filled.
- the value of the cell $t[i, j]$ holds the *distance* between the sub series $a[:i]$ and $b[:j]$
- the value of the cell $t[i, j]$ is computed using the values of cells $t[i-1, j]$, $t[i, j-1]$ and $t[i-1, j-1]$:
$$t[i, j] = d(i, j) + min(t[i-1, j], t[i-1, j-1], t[i, j-1])$$
where $d(i, j)$ is the distance between $s[i]$ and $s[j]$ (we will use the absolute difference)
An example with two series [0, 1, 1, 2, 2, 3, 5] and [0, 1, 2, 3, 5, 5, 5, 6]
%% Cell type:markdown id: tags:
(image from https://riptutorial.com/dynamic-programming/example/25780/introduction-to-dynamic-time-warping
<div align="middle">
<img src="./fig/dtw_ex_table.jpg" style="width: 100%">
</div>
%% Cell type:markdown id: tags:
why 6 in the t[-1, -1] ?
%% Cell type:markdown id: tags:
Easy enough to implement:
----------------------------------------
%% Cell type:code id: tags:
``` python
import numpy as np
def DTWDistance_pure_python(s1, s2):
""" Computes the dtw between s1 and s2 with distance the absolute distance
:param s1: the first series (ie an iterable over floats64)
:param s2: the second series (ie an iterable over floats64)
:returns: the dtw distance
:rtype: float64
"""
_dtw_mat = np.empty([len(s1), len(s2)])
_dtw_mat[0, 0] = abs(s1[0] - s2[0])
# two special cases : filling first row and columns
for j in range(1, len(s2)):
dist = abs(s1[0]-s2[j])
_dtw_mat[0, j] = dist + _dtw_mat[0, j-1]
for i in range(1, len(s1)):
dist = abs(s1[i]-s2[0])
_dtw_mat[i, 0] = dist + _dtw_mat[(i-1, 0)]
# filling the matrix
for i in range(1, len(s1)):
for j in range(1, len(s2)):
dist = abs(s1[i]-s2[j])
_dtw_mat[(i, j)] = dist + min(_dtw_mat[i-1, j],
_dtw_mat[i, j-1],
_dtw_mat[i-1, j-1])
return _dtw_mat[len(s1)-1, len(s2)-1], _dtw_mat
```
%% Cell type:code id: tags:
``` python
x = [1, 2, 3, 5, 5, 5, 6]
y = [1, 1, 2, 2, 3, 5]
nx = len(x)
ny = len(y)
d, mat = DTWDistance_pure_python(x, y)
print(d)
mat
```
%%%% Output: stream
1.0
%%%% Output: execute_result
array([[ 0., 0., 1., 2., 4., 8.],
[ 1., 1., 0., 0., 1., 4.],
[ 3., 3., 1., 1., 0., 2.],
[ 7., 7., 4., 4., 2., 0.],
[11., 11., 7., 7., 4., 0.],
[15., 15., 10., 10., 6., 0.],
[20., 20., 14., 14., 9., 1.]])
%% Cell type:markdown id: tags:
Cort
-------
**Input**: two series S1 and S2 *of same length*
**Output:** a dissimilarity measure
**Output:** a similarity measure
**Complexity:** O(|S1|+|S2|)
**Metric:** yes
What do we compute:
-------------------------------
The cosine distance between derivatives of the series.
The cosine similarity measure between derivatives of the series.
normalize
num = 0.0
sum_square_x = 0.0
sum_square_y = 0.0
for t in range(len(s1) - 1):
slope_1 = s1[t + 1] - s1[t]
slope_2 = s2[t + 1] - s2[t]
num += slope_1 * slope_2
sum_square_x += slope_1 * slope_1
sum_square_y += slope_2 * slope_2
return num / (np.sqrt(sum_square_x * sum_square_y))
%% Cell type:markdown id: tags:
What do we compute:
------------------------------
$$ \begin{eqnarray}
cort(A, B) &=& \cos(dA, dB) \\
&=& \frac{dA \cdot dB}{\Vert dA\Vert \Vert dB\Vert} \\
&=& \frac{\sum_{i=0}^T dA_i dB_i}{\Vert dA\Vert \Vert dB\Vert} \\
&=& \frac{\sum_{i=0}^T (A_{i+1}-A_i) (B_{i+1}-B_i)}{\sqrt{\sum_{i=0}^T (A_{i+1}-A_i)^2} \sqrt{\sum_{i=0}^T (B_{i+1}-B_i)^2}}
&=& \frac{\sum_{i=0}^{T} dA_i dB_i}{\Vert dA\Vert \Vert dB\Vert} \\
&=& \frac{\sum_{i=0}^{T-1} (A_{i+1}-A_i) (B_{i+1}-B_i)}{\sqrt{\sum_{i=0}^{T-1} (A_{i+1}-A_i)^2} \sqrt{\sum_{i=0}^{T-1} (B_{i+1}-B_i)^2}}
\end{eqnarray} $$
%% Cell type:markdown id: tags:
What do we compute:
------------------------------
<div align="middle">
<img src="./fig/cort_formule.png" style="width: 75%">
</div>
%% Cell type:markdown id: tags:
Easy enough to implement:
---------------------------------------
%% Cell type:code id: tags:
``` python
from math import sqrt
def cort(s1, s2):
""" Computes the cort between series one and two (assuming they have the same length)
:param s1: the first series (or any iterable over floats64)
:param s2: the second series (or any iterable over floats64)
:returns: the cort distance
:rtype: float
:precondition: series are assumed to be of same size
"""
num = 0.0
sum_square_x = 0.0
sum_square_y = 0.0
for t in range(len(s1)-1):
slope_1 = s1[t+1] - s1[t]
slope_2 = s2[t+1] - s2[t]
num = num + slope_1 * slope_2
sum_square_x = sum_square_x + (slope_1*slope_1)
sum_square_y = sum_square_y + (slope_2 * slope_2)
return num/(sqrt(sum_square_x*sum_square_y))
```
%% Cell type:code id: tags:
``` python
x = [1, 2, 3, 5, 5, 6]
y = [1, 1, 2, 2, 3, 5]
cort(x,y)
print(f"cort(x,2*x)={cort(x, 2*x)} cort([1,2], [2,1])={cort([1,2], [2,1])}")
```
%%%% Output: execute_result
%%%% Output: stream
0.4629100498862757
cort(x,2*x)=1.0 cort([1,2], [2,1])=-1.0
......
all:
python3 setup.py develop
clean:
rm -f cdtw.*.so
rm -rf build dtw_cort_dist_mat.egg-info
#include <stdlib.h>
#include <stdio.h>
#include <float.h>
#include <math.h>
#include <float.h> /* for max_float */
#include "cdtw.h"
double min3(const double x, const double y, const double z){
if (x < y)
if (x < z) return x;
if (y < z) return y;
return z;
}
/*
Returns the minimum-distance warp path
between multivariate time series x and y.
*/
double dtw(double* x, double* y, int size_x, int size_y){
int i, j;
double* temp;
double *costP = malloc(size_x*sizeof(double));
if (! costP) return -1;
double *costC = malloc(size_x*sizeof(double));
if (! costC){
free(costP);
return -1; /* TODO: better handling of the error */
}
costP[0] = fabs(x[0] - y[0]);
for (i=1; i < size_x; i++){
costP[i] = fabs(x[i] - y[0]) + costP[i-1];
}
for (j=1; j< size_y; j++){
costC[0] = costP[0] + fabs(x[0] - y[j]);
for (i=1; i < size_x; i++){
costC[i] = min3(costC[i-1], costP[i-1], costP[i]) + fabs(x[i] - y[j]);
}
temp = costP;
costP = costC;
costC = temp;
}
/* result is read in costP because switching between costP and costC is done */
double ret = costP[size_x-1];
free(costP);
free(costC);
return ret;
}
double cort(double *x, double *y, int size){
int t;
double slopex, slopey;
double num;
double sumSquareXslope, sumSquareYslope;
num = sumSquareXslope = sumSquareYslope = 0.0;
for (t = 0; t < size-1; t++){
slopex = x[t+1] - x[t];
slopey = y[t+1] - y[t];
num += slopex * slopey;
sumSquareXslope += slopex * slopex;
sumSquareYslope += slopey * slopey;
}
return num/sqrt(sumSquareXslope*sumSquareYslope);
}
double dtw(double* x, double* y, int size_x, int size_y);
double cort(double *x, double *y, int size);
from setuptools import setup, Extension
version = "0.1"
module_distance = Extension(
name="cdtw",
sources=["cdtw.c"],
extra_compile_args=["-O3", "-mavx2"],
extra_link_args=["-O3"],
)
setup(
name="dtw_cort_dist_mat",
version=version,
description="data scientist tool for time series",
long_description="data scientist tool for time series",
classifiers=[], # Get strings from http://pypi.python.org/pypi?%3Aaction=list_classifiers
author="IKATS project",
author_email="Franck.Thollard@imag.fr",
url="http://ikats.imag.fr",
license="GPL",
include_package_data=True,
zip_safe=False,
install_requires=["cffi", "numpy", "setuptools"],
entry_points="",
ext_modules=[module_distance],
)
from pathlib import Path
from invoke import task
import numpy as np
here = Path(__file__).parent.absolute()
......@@ -11,7 +12,10 @@ def clean(c):
for path in directories:
if (path / "Makefile").exists():
c.run(f"cd {path} && make clean", echo=True)
for algo in ("dtw", "cort"):
f = path / "ref_{algo}.npy"
if f.exists():
f.unlink()
@task
def build(c):
......@@ -19,6 +23,18 @@ def build(c):
if (path / "Makefile").exists():
c.run(f"cd {path} && make", echo=True)
@task
def check(c):
ref_dtw = np.load("ref_small_dtw.npy")
ref_cort = np.load("ref_small_cort.npy")
for path in directories:
c.run(f"cd {path} && ./dtw_cort_dist_mat.py -s res", echo=True)
calc_dtw = np.load(f"{path / 'res_dtw.npy'}")
print(f"{path.name} dtw : allclose {np.allclose(ref_dtw, calc_dtw)}")
calc_cort = np.load(f"{path / 'res_cort.npy'}")
print(f"{path.name} cort: allclose {np.allclose(ref_cort, calc_cort)}")
@task
def bench(c, medium=False):
......
......@@ -33,6 +33,9 @@ def main(compute, only_init=False):
parser.add_argument(
"-n", type=int, default=None, help="number of series (default all 200)"
)
parser.add_argument(
"-s", type=str, default=None, help="save the matrices (provide prefixes)"
)
parser.add_argument(
"-v", action="store_true", default=None, help="visualize the matrix"
)
......@@ -55,12 +58,18 @@ def main(compute, only_init=False):
_dist_mat_dtw, _dist_mat_cort = compute(series, nb_series)
print("elapsed time = {:.3f} s".format(time() - t0))
if args.s:
dtw_mat = "{}_dtw".format(args.s)
np.save(dtw_mat, _dist_mat_dtw)
cort_mat = "{}_cort".format(args.s)
np.save(cort_mat, _dist_mat_cort)
if args.v:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2)
cax0 = axes[0].imshow(_dist_mat_dtw, cmap=plt.cm.gray)
axes[0].imshow(_dist_mat_dtw, cmap=plt.cm.gray)
axes[0].set_title("dtw")
cax1 = axes[1].imshow(_dist_mat_cort, cmap=plt.cm.gray)
axes[1].imshow(_dist_mat_cort, cmap=plt.cm.gray)
axes[1].set_title("cort")
plt.show()
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment