Skip to content

Commit 1a6ce00

Browse files
add rapidfuzz
1 parent 4fd1296 commit 1a6ce00

File tree

4 files changed

+367
-3
lines changed

4 files changed

+367
-3
lines changed

Chapter5/natural_language_processing.ipynb

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2040,6 +2040,157 @@
20402040
"[Link to textstat](https://github.com/shivam5992/textstat)."
20412041
]
20422042
},
2043+
{
2044+
"cell_type": "markdown",
2045+
"id": "d9490655",
2046+
"metadata": {},
2047+
"source": [
2048+
"### RapidFuzz: Rapid String Matching in Python"
2049+
]
2050+
},
2051+
{
2052+
"cell_type": "code",
2053+
"execution_count": null,
2054+
"id": "ad96e31b",
2055+
"metadata": {
2056+
"ExecuteTime": {
2057+
"end_time": "2022-03-30T12:17:33.922993Z",
2058+
"start_time": "2022-03-30T12:17:29.753346Z"
2059+
},
2060+
"tags": [
2061+
"hide-cell"
2062+
]
2063+
},
2064+
"outputs": [],
2065+
"source": [
2066+
"!pip install rapidfuzz"
2067+
]
2068+
},
2069+
{
2070+
"cell_type": "markdown",
2071+
"id": "c503d2a2",
2072+
"metadata": {},
2073+
"source": [
2074+
"If you want to find strings that are similar to another string above a certain threshold, use RapidFuzz. RapidFuzz is a Python library that allows you to quickly match strings."
2075+
]
2076+
},
2077+
{
2078+
"cell_type": "code",
2079+
"execution_count": 2,
2080+
"id": "936a5a98",
2081+
"metadata": {
2082+
"ExecuteTime": {
2083+
"end_time": "2022-03-30T12:18:00.094620Z",
2084+
"start_time": "2022-03-30T12:18:00.089601Z"
2085+
}
2086+
},
2087+
"outputs": [],
2088+
"source": [
2089+
"from rapidfuzz import fuzz"
2090+
]
2091+
},
2092+
{
2093+
"cell_type": "markdown",
2094+
"id": "98cec972",
2095+
"metadata": {},
2096+
"source": [
2097+
"Calculates the normalized Indel distance between 2 strings"
2098+
]
2099+
},
2100+
{
2101+
"cell_type": "code",
2102+
"execution_count": 6,
2103+
"id": "03ea76ef",
2104+
"metadata": {
2105+
"ExecuteTime": {
2106+
"end_time": "2022-03-30T12:23:12.478197Z",
2107+
"start_time": "2022-03-30T12:23:12.470715Z"
2108+
}
2109+
},
2110+
"outputs": [
2111+
{
2112+
"data": {
2113+
"text/plain": [
2114+
"98.24561403508771"
2115+
]
2116+
},
2117+
"execution_count": 6,
2118+
"metadata": {},
2119+
"output_type": "execute_result"
2120+
}
2121+
],
2122+
"source": [
2123+
"fuzz.ratio(\"Let's meet at 10 am tomorrow\", \"Let's meet at 10 am tommorrow\")"
2124+
]
2125+
},
2126+
{
2127+
"cell_type": "code",
2128+
"execution_count": 5,
2129+
"id": "9fdcffe5",
2130+
"metadata": {
2131+
"ExecuteTime": {
2132+
"end_time": "2022-03-30T12:20:35.183261Z",
2133+
"start_time": "2022-03-30T12:20:35.173673Z"
2134+
}
2135+
},
2136+
"outputs": [
2137+
{
2138+
"data": {
2139+
"text/plain": [
2140+
"54.54545454545454"
2141+
]
2142+
},
2143+
"execution_count": 5,
2144+
"metadata": {},
2145+
"output_type": "execute_result"
2146+
}
2147+
],
2148+
"source": [
2149+
"fuzz.ratio(\"here you go\", \"you go here\")"
2150+
]
2151+
},
2152+
{
2153+
"cell_type": "markdown",
2154+
"id": "a6cbdb2a",
2155+
"metadata": {},
2156+
"source": [
2157+
"Sort the words in the strings and calculates the fuzz.ratio between them"
2158+
]
2159+
},
2160+
{
2161+
"cell_type": "code",
2162+
"execution_count": 4,
2163+
"id": "3e3443d0",
2164+
"metadata": {
2165+
"ExecuteTime": {
2166+
"end_time": "2022-03-30T12:20:24.748537Z",
2167+
"start_time": "2022-03-30T12:20:24.740736Z"
2168+
}
2169+
},
2170+
"outputs": [
2171+
{
2172+
"data": {
2173+
"text/plain": [
2174+
"100.0"
2175+
]
2176+
},
2177+
"execution_count": 4,
2178+
"metadata": {},
2179+
"output_type": "execute_result"
2180+
}
2181+
],
2182+
"source": [
2183+
"fuzz.token_sort_ratio(\"here you go\", \"you go here\")"
2184+
]
2185+
},
2186+
{
2187+
"cell_type": "markdown",
2188+
"id": "821ea3fb",
2189+
"metadata": {},
2190+
"source": [
2191+
"[Link to RapidFuzz](https://github.com/maxbachmann/RapidFuzz)."
2192+
]
2193+
},
20432194
{
20442195
"cell_type": "markdown",
20452196
"id": "34bbf520",

docs/Chapter5/natural_language_processing.html

Lines changed: 64 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -697,9 +697,14 @@ <h1 class="site-logo" id="site-title">Effective Python for Data Scientists</h1>
697697
6.6.11. textstat: Calculate Statistics From Text
698698
</a>
699699
</li>
700+
<li class="toc-h2 nav-item toc-entry">
701+
<a class="reference internal nav-link" href="#rapidfuzz-rapid-string-matching-in-python">
702+
6.6.12. RapidFuzz: Rapid String Matching in Python
703+
</a>
704+
</li>
700705
<li class="toc-h2 nav-item toc-entry">
701706
<a class="reference internal nav-link" href="#checklist-create-data-to-test-your-nlp-model">
702-
6.6.12. Checklist: Create Data to Test Your NLP Model
707+
6.6.13. Checklist: Create Data to Test Your NLP Model
703708
</a>
704709
</li>
705710
</ul>
@@ -1694,8 +1699,65 @@ <h2><span class="section-number">6.6.11. </span>textstat: Calculate Statistics F
16941699
</div>
16951700
<p><a class="reference external" href="https://github.com/shivam5992/textstat">Link to textstat</a>.</p>
16961701
</div>
1702+
<div class="section" id="rapidfuzz-rapid-string-matching-in-python">
1703+
<h2><span class="section-number">6.6.12. </span>RapidFuzz: Rapid String Matching in Python<a class="headerlink" href="#rapidfuzz-rapid-string-matching-in-python" title="Permalink to this headline"></a></h2>
1704+
<div class="cell tag_hide-cell docutils container">
1705+
<div class="cell_input docutils container">
1706+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="o">!</span>pip install rapidfuzz
1707+
</pre></div>
1708+
</div>
1709+
</div>
1710+
</div>
1711+
<p>If you want to find strings that are similar to another string above a certain threshold, use RapidFuzz. RapidFuzz is a Python library that allows you to quickly match strings.</p>
1712+
<div class="cell docutils container">
1713+
<div class="cell_input docutils container">
1714+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">rapidfuzz</span> <span class="kn">import</span> <span class="n">fuzz</span>
1715+
</pre></div>
1716+
</div>
1717+
</div>
1718+
</div>
1719+
<p>Calculates the normalized Indel distance between 2 strings</p>
1720+
<div class="cell docutils container">
1721+
<div class="cell_input docutils container">
1722+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">fuzz</span><span class="o">.</span><span class="n">ratio</span><span class="p">(</span><span class="s2">&quot;Let&#39;s meet at 10 am tomorrow&quot;</span><span class="p">,</span> <span class="s2">&quot;Let&#39;s meet at 10 am tommorrow&quot;</span><span class="p">)</span>
1723+
</pre></div>
1724+
</div>
1725+
</div>
1726+
<div class="cell_output docutils container">
1727+
<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>98.24561403508771
1728+
</pre></div>
1729+
</div>
1730+
</div>
1731+
</div>
1732+
<div class="cell docutils container">
1733+
<div class="cell_input docutils container">
1734+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">fuzz</span><span class="o">.</span><span class="n">ratio</span><span class="p">(</span><span class="s2">&quot;here you go&quot;</span><span class="p">,</span> <span class="s2">&quot;you go here&quot;</span><span class="p">)</span>
1735+
</pre></div>
1736+
</div>
1737+
</div>
1738+
<div class="cell_output docutils container">
1739+
<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>54.54545454545454
1740+
</pre></div>
1741+
</div>
1742+
</div>
1743+
</div>
1744+
<p>Sort the words in the strings and calculates the fuzz.ratio between them</p>
1745+
<div class="cell docutils container">
1746+
<div class="cell_input docutils container">
1747+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">fuzz</span><span class="o">.</span><span class="n">token_sort_ratio</span><span class="p">(</span><span class="s2">&quot;here you go&quot;</span><span class="p">,</span> <span class="s2">&quot;you go here&quot;</span><span class="p">)</span>
1748+
</pre></div>
1749+
</div>
1750+
</div>
1751+
<div class="cell_output docutils container">
1752+
<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>100.0
1753+
</pre></div>
1754+
</div>
1755+
</div>
1756+
</div>
1757+
<p><a class="reference external" href="https://github.com/maxbachmann/RapidFuzz">Link to RapidFuzz</a>.</p>
1758+
</div>
16971759
<div class="section" id="checklist-create-data-to-test-your-nlp-model">
1698-
<h2><span class="section-number">6.6.12. </span>Checklist: Create Data to Test Your NLP Model<a class="headerlink" href="#checklist-create-data-to-test-your-nlp-model" title="Permalink to this headline"></a></h2>
1760+
<h2><span class="section-number">6.6.13. </span>Checklist: Create Data to Test Your NLP Model<a class="headerlink" href="#checklist-create-data-to-test-your-nlp-model" title="Permalink to this headline"></a></h2>
16991761
<div class="cell tag_hide-cell docutils container">
17001762
<div class="cell_input docutils container">
17011763
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="o">!</span>pip install checklist torch

docs/_sources/Chapter5/natural_language_processing.ipynb

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2040,6 +2040,157 @@
20402040
"[Link to textstat](https://github.com/shivam5992/textstat)."
20412041
]
20422042
},
2043+
{
2044+
"cell_type": "markdown",
2045+
"id": "d9490655",
2046+
"metadata": {},
2047+
"source": [
2048+
"### RapidFuzz: Rapid String Matching in Python"
2049+
]
2050+
},
2051+
{
2052+
"cell_type": "code",
2053+
"execution_count": null,
2054+
"id": "ad96e31b",
2055+
"metadata": {
2056+
"ExecuteTime": {
2057+
"end_time": "2022-03-30T12:17:33.922993Z",
2058+
"start_time": "2022-03-30T12:17:29.753346Z"
2059+
},
2060+
"tags": [
2061+
"hide-cell"
2062+
]
2063+
},
2064+
"outputs": [],
2065+
"source": [
2066+
"!pip install rapidfuzz"
2067+
]
2068+
},
2069+
{
2070+
"cell_type": "markdown",
2071+
"id": "c503d2a2",
2072+
"metadata": {},
2073+
"source": [
2074+
"If you want to find strings that are similar to another string above a certain threshold, use RapidFuzz. RapidFuzz is a Python library that allows you to quickly match strings."
2075+
]
2076+
},
2077+
{
2078+
"cell_type": "code",
2079+
"execution_count": 2,
2080+
"id": "936a5a98",
2081+
"metadata": {
2082+
"ExecuteTime": {
2083+
"end_time": "2022-03-30T12:18:00.094620Z",
2084+
"start_time": "2022-03-30T12:18:00.089601Z"
2085+
}
2086+
},
2087+
"outputs": [],
2088+
"source": [
2089+
"from rapidfuzz import fuzz"
2090+
]
2091+
},
2092+
{
2093+
"cell_type": "markdown",
2094+
"id": "98cec972",
2095+
"metadata": {},
2096+
"source": [
2097+
"Calculates the normalized Indel distance between 2 strings"
2098+
]
2099+
},
2100+
{
2101+
"cell_type": "code",
2102+
"execution_count": 6,
2103+
"id": "03ea76ef",
2104+
"metadata": {
2105+
"ExecuteTime": {
2106+
"end_time": "2022-03-30T12:23:12.478197Z",
2107+
"start_time": "2022-03-30T12:23:12.470715Z"
2108+
}
2109+
},
2110+
"outputs": [
2111+
{
2112+
"data": {
2113+
"text/plain": [
2114+
"98.24561403508771"
2115+
]
2116+
},
2117+
"execution_count": 6,
2118+
"metadata": {},
2119+
"output_type": "execute_result"
2120+
}
2121+
],
2122+
"source": [
2123+
"fuzz.ratio(\"Let's meet at 10 am tomorrow\", \"Let's meet at 10 am tommorrow\")"
2124+
]
2125+
},
2126+
{
2127+
"cell_type": "code",
2128+
"execution_count": 5,
2129+
"id": "9fdcffe5",
2130+
"metadata": {
2131+
"ExecuteTime": {
2132+
"end_time": "2022-03-30T12:20:35.183261Z",
2133+
"start_time": "2022-03-30T12:20:35.173673Z"
2134+
}
2135+
},
2136+
"outputs": [
2137+
{
2138+
"data": {
2139+
"text/plain": [
2140+
"54.54545454545454"
2141+
]
2142+
},
2143+
"execution_count": 5,
2144+
"metadata": {},
2145+
"output_type": "execute_result"
2146+
}
2147+
],
2148+
"source": [
2149+
"fuzz.ratio(\"here you go\", \"you go here\")"
2150+
]
2151+
},
2152+
{
2153+
"cell_type": "markdown",
2154+
"id": "a6cbdb2a",
2155+
"metadata": {},
2156+
"source": [
2157+
"Sort the words in the strings and calculates the fuzz.ratio between them"
2158+
]
2159+
},
2160+
{
2161+
"cell_type": "code",
2162+
"execution_count": 4,
2163+
"id": "3e3443d0",
2164+
"metadata": {
2165+
"ExecuteTime": {
2166+
"end_time": "2022-03-30T12:20:24.748537Z",
2167+
"start_time": "2022-03-30T12:20:24.740736Z"
2168+
}
2169+
},
2170+
"outputs": [
2171+
{
2172+
"data": {
2173+
"text/plain": [
2174+
"100.0"
2175+
]
2176+
},
2177+
"execution_count": 4,
2178+
"metadata": {},
2179+
"output_type": "execute_result"
2180+
}
2181+
],
2182+
"source": [
2183+
"fuzz.token_sort_ratio(\"here you go\", \"you go here\")"
2184+
]
2185+
},
2186+
{
2187+
"cell_type": "markdown",
2188+
"id": "821ea3fb",
2189+
"metadata": {},
2190+
"source": [
2191+
"[Link to RapidFuzz](https://github.com/maxbachmann/RapidFuzz)."
2192+
]
2193+
},
20432194
{
20442195
"cell_type": "markdown",
20452196
"id": "34bbf520",

docs/searchindex.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)