Skip to content

elifkrtl/pkNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

k-nearest neighbor algorithm with p-adic distance functions

The k-nearest neighbor (k-NN) algorithm is a widely used method for classification and prediction. Its performance is strongly influenced by the chosen distance (or similarity) measure. By Ostrowski’s theorem, any non-trivial absolute value on $\mathbb{Q}$ is, up to equivalence, either the usual absolute value or a $p$-adic absolute value for some prime $p$. In this project, $p$-adic absolute values are coupled with k-NN to define alternative distance functions on $\mathbb{Q}^n$.

If you want to try p-adic calculations in Python, please import the p_adic.py script first.

Related works

This work is part of the published papers in Neurocomputing and Journal of Classification.

Kartal, E., Çalışkan, F., Eskişehirli, B. B., & Özen, Z. (2024). p-adic distance and k-Nearest Neighbor classification. Neurocomputing, 578, 127400.

Kartal, E., Çalışkan, F., Eskişehirli, B. B., & Özen, Z. (in press). p-adic Distance Functions and Comparative Performance Analysis of Distance Functions with the k-NN Classifier. Journal of Classification.

Funding

This work was supported by the Scientific and Technological Research Council of Türkiye (TÜBİTAK) (Grant No. 123E293).


The $p$-adic metric on $\mathbb{Q}$

Let $p$ be a prime. For a nonzero integer $\alpha$, the $p$-adic valuation $\mathrm{ord}_p(\alpha)$ is the unique nonnegative integer such that

$$ \alpha = p^{\mathrm{ord}_p(\alpha)},\alpha', $$

where $\alpha'\in\mathbb{Z}$ and $p\nmid \alpha'$. For a nonzero rational number $x=\frac{a}{b}\in\mathbb{Q}$,

$$ \mathrm{ord}_p(x)=\mathrm{ord}_p(a)-\mathrm{ord}_p(b) $$

and

$$ \mathrm{ord}_p(0)=+\infty. $$

Valuation examples

$$ \mathrm{ord}_2(64)=6,\qquad \mathrm{ord}_3(32)=0,\qquad \mathrm{ord}_5(10)=1,\qquad \mathrm{ord}_7(9)=0. $$

The $p$-adic absolute value

The $p$-adic absolute value $|\cdot|_p$ on $\mathbb{Q}$ is defined by

$$ |x|_p= \begin{cases} p^{-\mathrm{ord}_p(x)}, & x\neq 0,\\ 0, & x=0. \end{cases} $$

Absolute value examples

For $p=5$,

$$ |125|_5=5^{-3}=\frac{1}{125}=0.008. $$

For $p=3$,

$$ |125|_3=3^{0}=1. $$

For $p=2$,

$$ \left|\frac{36}{54}\right|_2 =\left|\frac{2}{3}\right|_2 =2^{-\left(\mathrm{ord}_2(2)-\mathrm{ord}_2(3)\right)} =2^{-(1-0)} =\frac{1}{2}=0.5. $$


The $p$-adic distance functions on $\mathbb{Q}^n$

The $p$-adic metric on $\mathbb{Q}$ is

$$ d_p(x,y)=|x-y|_p, $$

where $x,y\in\mathbb{Q}$. Using coordinate-wise differences in $\mathbb{Q}^n$, three $p$-adic distance functions are defined:

  • $p$-adic Manhattan distance
  • $p$-adic Euclidean distance
  • $p$-adic Chebyshev distance

The terminology is adopted because distances defined via the $p$-adic absolute value provide natural $p$-adic counterparts to the classical distances induced by the usual absolute value.

Let $x=(x_1,\dots,x_n),\quad y=(y_1,\dots,y_n)\in\mathbb{Q}^n$.


$p$-adic Manhattan distance

The $p$-adic Manhattan distance is defined by

$$ d_{\mathrm{pManh}}(x,y)=\sum_{i=1}^{n}|x_i-y_i|_p. $$

Earlier text/code in this repository, a Manhattan-based $p$-adic distance function for the $k$-NN algorithm was introduced in Kartal et al. (2024), although it was not assigned a specific name. For clarity and consistency, it is referred to here as the $p$-adic Manhattan distance ($d_{\mathrm{pManh}}$) (Kartal et. al, in press).

Example 1 ($p=5$)

$$ d_5(3,125)=|3-125|_5=|-122|_5=|122|_5. $$

Since $5\nmid 122$ and $\mathrm{ord}_5(122)=0$, $|122|_5=5^{0}=1$. Therefore,

$$ d_5(3,125)=1. $$

Example 2 ($p=3$)

Let

$$ x=(0.2,14,-7,5.6),\qquad y=(0.083,-13,20,-1.4). $$

Then

$$ x-y=(0.117,27,-27,7)=\left(\frac{117}{1000},27,-27,7\right). $$

Now,

$$ \left|\frac{117}{1000}\right|_3=\frac{1}{9} $$

because $117=3^2\cdot 13$ and $3\nmid 1000$. Also, $|27|_3=|-27|_3=\frac{1}{27}$ and $|7|_3=1$. Hence

$$ d_{\mathrm{3Manh}}(x,y) = \frac{1}{9}+\frac{1}{27}+\frac{1}{27}+1 = \frac{32}{27} = 1.185185. $$

Example 3 ($p=5$)

Let

$$ x=\left(\frac{1}{25},10,-6\right),\qquad y=\left(0,5,9\right). $$

Then

$$ x-y=\left(\frac{1}{25},5,-15\right). $$

For $p=5$, $\left|\frac{1}{25}\right|_5=25$, $|5|_5=\frac{1}{5}$, and $|-15|_5=|15|_5=\frac{1}{5}$. Therefore,

$$ d_{\mathrm{5Manh}}(x,y) =25+\frac{1}{5}+\frac{1}{5} =25.4. $$


$p$-adic Euclidean distance

The $p$-adic Euclidean distance is defined by

$$ d_{\mathrm{pEuc}}(x,y)=\sqrt{\sum_{i=1}^{n}|x_i-y_i|_p^{2}}. $$

Example 1 ($p=3$)

Using the same vectors as above,

$$ (|x_1-y_1|_3,|x_2-y_2|_3, |x_3-y_3|_3,|x_4-y_4|_3) =\left(\frac{1}{9},\frac{1}{27},\frac{1}{27},1\right). $$

Thus

$$ d_{\mathrm{3Euc}}(x,y) =\sqrt{\left(\frac{1}{9}\right)^2+\left(\frac{1}{27}\right)^2+\left(\frac{1}{27}\right)^2+1^2} =\sqrt{\frac{740}{729}} =1.007516. $$

Example 2 ($p=5$)

With

$$ x=\left(\frac{1}{25},10,-6\right),\qquad y=\left(0,5,9\right), $$

we have $(|x_1-y_1|_5,|x_2-y_2|_5, |x_3-y_3|_5)=(25,\frac{1}{5},\frac{1}{5})$. Hence

$$ d_{\mathrm{5Euc}}(x,y) =\sqrt{25^2+\left(\frac{1}{5}\right)^2+\left(\frac{1}{5}\right)^2} =\sqrt{625+\frac{2}{25}} =25.001600. $$


$p$-adic Chebyshev distance

The $p$-adic Chebyshev distance is defined by

$$ d_{\mathrm{pCheb}}(x,y)=\max_{ i}|x_i-y_i|_p. $$

Remark that the $p$-adic Chebyshev function can be viewed as the natural extension of the standard $p$-adic max-norm.

Example 1 ($p=3$)

From the $p=3$ example,

$$ \begin{aligned} d_{\mathrm{3Cheb}}(x,y) &=\max \left(\frac{1}{9},\frac{1}{27},\frac{1}{27},1\right)=1. \end{aligned} $$

Example 2 ($p=5$)

From the $p=5$ example,

$$ d_{\mathrm{5Cheb}}(x,y) =\max\left(25,\frac{1}{5},\frac{1}{5}\right) =25. $$


Decimal rounding (dec) for numerical and mixed datasets

When $p$-adic distances are computed on numerical or mixed datasets, input values are often rounded before $p$-adic computations. The rounding precision is denoted by dec. Since $|x|_p=p^{-\mathrm{ord}_p(x)}$, a small rounding change may alter $\mathrm{ord}_p(\cdot)$ and therefore change $|\cdot|_p$ and the resulting distances.

Demonstration of the effect of dec (using the $p=3$ example)

Let

$$ x=(0.2,14,-7,5.6),\qquad y=(0.083,-13,20,-1.4), $$

so $x-y=(0.117,27,-27,7)$.

Case A (no rounding): For $p=3$, $(|0.117|_3,|27|_3,|-27|_3,|7|_3)= \left(\frac{1}{9},\frac{1}{27},-\frac{1}{27},1\right)$. Then

$$ d_{\mathrm{3Manh}}(x,y)=1.185185,\qquad d_{\mathrm{3Euc}}(x,y)=1.007516,\qquad d_{\mathrm{3Cheb}}(x,y)=1. $$

Case B (dec = 1): $0.117\mapsto 0.1=\frac{1}{10}$, hence $\left|0.1\right|_3=1$. The rounded difference vector becomes $\left(\frac{1}{10},27,-27,7\right)$. Then

$$ d_{\mathrm{3Manh}}(x,y) =1+\frac{1}{27}+\frac{1}{27}+1 =2.074074, $$

$$ d_{\mathrm{3Euc}}(x,y) =\sqrt{1^2+\left(\frac{1}{27}\right)^2+\left(\frac{1}{27}\right)^2+1^2} =1.415183, $$

$$ d_{\mathrm{3Cheb}}(x,y) =\max\left(1,\frac{1}{27},\frac{1}{27},1\right) =1. $$

The pseudocode for the k-nearest neighbor algorithm with $p$-adic distance functions

Let the labeled dataset be

$$ X=\left((x_i,y_i)\right)_{i=1}^{n}, $$

where $x_i\in\mathbb{Q}^m$ and $y_i$ is the class label. For a query sample $z$, the predicted label is determined by the majority vote among the $k$ nearest neighbors:

$$ y_z=\underset{c}{\arg\max}\sum_{(x_i,y_i)\in P} 1(y_i=c), $$

where $P\subseteq X$ contains the $k$ nearest neighbors of $z$ under the selected $p$-adic distance (e.g., $d_{\mathrm{pManh}}$, $d_{\mathrm{pEuc}}$, or $d_{\mathrm{pCheb}}$).

In Python, these distances can be integrated into sklearn.neighbors.KNeighborsClassifier via a user-defined metric function. Datasets can be loaded with the datasets module sklearn.datasets, and $p$-adic computations can be performed after importing p_adic.py.

Releases

No releases published

Packages

 
 
 

Contributors

Languages