initial commit

2023-12-25 17:21:39 +08:00
commit 0bd1089d74
94 changed files with 18648 additions and 0 deletions
--- a/data/posts/2023-11-22-rabin-karp.md
+++ b/data/posts/2023-11-22-rabin-karp.md
@@ -0,0 +1,123 @@
+---
+title: "Rabin-Karp Algorithm"
+time: "2023-11-22"
+tags: ["algorithm"]
+summary: "It is designed to address the multiple pattern string matching problem."
+---
+
+The Rabin-Karp algorithm, also known as the Karp-Rabin algorithm, was introduced by _Richard M. Karp_ and _Michael O. Rabin_ in 1987. It is designed to address the multiple pattern string matching problem.
+
+Its implementation is somewhat unconventional. It begins by computing the hash values of two strings and then determines whether there is a match by comparing these hash values.
+
+## Algorithm Analysis and Implementation
+
+Choosing an appropriate hash function is crucial. Assuming the text string is $t[0, n)$, and the pattern string is $p[0, m)$, where $0<m<n$, let $Hash(t[i,j])$ represent the hash value of the substring $t[i, j]$.
+
+When $Hash(t[0, m-1])!=Hash(p[0,m-1])$, it is natural to compare $Hash(t[1, m])$. In this process, if we recalculate the hash value for the substring $t[1, m]$, it would require $O(n)$ time complexity, which is not cost-effective. Observing that there are m-1 overlapping characters between the substrings $t[0, m-1]$ and $t[1, m]$, we can use a rolling hash function. This reduces the time complexity of recalculation to $O(1)$.
+
+The rolling hash function used in the Rabin-Karp algorithm primarily leverages the concept of [Rabin fingerprint](https://en.wikipedia.org/wiki/Rabin_fingerprint). For example, the formula to calculate the hash value of the substring $t[0, m-1]$ is as follows:
+
+$$
+Hash(t[0, m-1])=t[0]*b^{m-1}+t[1]*b^{m-2}+...+t[m-1]*b^0
+\\
+\tag{t[0] represents the ASCII code of the character}
+$$
+
+Here, $b$ is a constant. In Rabin-Karp, it is generally set to 256 because the maximum value of a character does not exceed 255. The formula above has an issue - hash values could overflow. To address this, we take the modulus, and the value should be as large as possible and preferably a prime number. Here, we take 101.
+
+The formula to calculate the hash value of the substring $t[1, m]$ is then:
+
+$$
+Hash(t[1,m])=(Hash(t[0,m-1])-t[0]*b^{m-1})*b+t[m]*b^0\\\tag{Please compare with the previous formula}
+$$
+
+The complete code is as follows:
+
+```cpp
+#include <iostream>
+#include <string.h>
+
+using namespace std;
+
+#define BASE 256
+#define MODULUS 101
+
+void RabinKarp(char t[], char p[])
+{
+    int t_len = strlen(t);
+    int p_len = strlen(p);
+
+    // For rolling hash
+    int h = 1;
+    for (int i = 0; i < p_len - 1; i++)
+        h = (h * BASE) % MODULUS;
+
+    int t_hash = 0;
+    int p_hash = 0;
+    for (int i = 0; i < p_len; i++)
+    {
+        t_hash = (BASE * t_hash + t[i]) % MODULUS;
+        p_hash = (BASE * p_hash + p[i]) % MODULUS;
+    }
+
+    int i = 0;
+    while (i <= t_len - p_len)
+    {
+        // Considering the possibility of hash collisions, we use memcmp for additional verification
+        if (t_hash == p_hash && memcmp(p, t + i, p_len) == 0)
+            cout << p << " is found at index " << i << endl;
+
+        // Rolling hash
+        t_hash = (BASE * (t_hash - t[i] * h) + t[i + p_len]) % MODULUS;
+
+        // Avoiding negative values
+        if (t_hash < 0)
+            t_hash = t_hash + MODULUS;
+
+        i++;
+    }
+}
+
+int main()
+{
+    char t[100] = "It is a test, but not just a test";
+    char p[10] = "test";
+
+    RabinKarp(t, p);
+
+    return 0;
+}
+```
+
+The output is as follows:
+
+```text
+test is found at index 8
+test is found at index 29
+```
+
+## Complexity Analysis
+
+Let's examine the space complexity first, which is easily determined: $S(n)=O(1)$.
+
+Now, consider the time complexity. Let the length of the text string be n and the pattern string be m. Preprocessing requires $O(m)$, and during matching, in the best case where there are no hash collisions, $T_{best}(n)=O(n-m)$. In the worst case, where there is a collision every time, $T_{worst}(n)=O((n-m)*m)$. In practical scenarios, n is often much larger than m, so the final complexity table is:
+
+| $S_{n}$        | $O(1)$  |
+| -------------- | ------- |
+| $T_{best}(n)$  | $O(n)$  |
+| $T_{worst}(n)$ | $O(mn)$ |
+
+## Application Analysis
+
+The primary application of the Rabin-Karp algorithm is in plagiarism detection for articles, such as the detection system used by [Semantic Scholar](https://www.semanticscholar.org/).
+
+However, from the complexity data above, the Rabin-Karp algorithm does not seem to have a significant advantage. Is it practical for detecting text plagiarism? Feedback from actual usage results indicates that the time complexity for plagiarism detection is only $O(n)$. I believe this is mainly due to the following two points:
+
+1. In real-life articles, the text data does not often exhibit as many hash collisions as we might imagine.
+2. The original content in a submitted article is likely to be much larger than the plagiarized content. In other words, successful matches do not occur as frequently as we might imagine.
+
+## References
+
+- [Rabin–Karp algorithm](https://en.wikipedia.org/wiki/Rabin–Karp_algorithm)
+- [Searching for Patterns | Set 3 (Rabin-Karp Algorithm)](https://www.geeksforgeeks.org/searching-for-patterns-set-3-rabin-karp-algorithm/)
+- [Computer Algorithms: Rabin-Karp String Searching](http://www.stoimen.com/blog/2012/04/02/computer-algorithms-rabin-karp-string-searching/)