Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
b2f3cb0
WIP: logger migriate to rich
wakaka6 Apr 10, 2025
9499164
feat(browser): improve browser profile management and cleanup
unclecode Apr 29, 2025
50f0b83
feat(linkedin): add prospect-wizard app with scraping and visualization
unclecode Apr 30, 2025
cd2b490
refactor(logger): Apply the Enumeration for color
wakaka6 May 1, 2025
0e5d672
Merge branch 'pr-971' into merge-pr971
unclecode May 1, 2025
ee01b81
Merge branch 'merge-pr971' into next
unclecode May 1, 2025
7c2fd52
fix: incorrect params and commands in linkedin app readme
aravindkarnam May 1, 2025
94e9959
feat(docker-api): add job-based polling endpoints for crawl and LLM t…
unclecode May 1, 2025
baf7f6a
fix: typo in readme
aravindkarnam May 2, 2025
5cc58f9
fix: 1. duplicate verbose flag 2.inconsistency in argument name --pro…
aravindkarnam May 2, 2025
6650b2f
fix: replace openAI with litellm to support multiple llm providers
aravindkarnam May 2, 2025
bd5a9ac
updated readme with arguments for litellm
aravindkarnam May 2, 2025
87d4b0f
format bash scripts properly so copy & paste may work without issues
aravindkarnam May 2, 2025
9b5ccac
feat(extraction): add RegexExtractionStrategy for pattern-based extra…
unclecode May 2, 2025
38ebcbb
fix: provide support for local llm by adding it to the arguments
aravindkarnam May 5, 2025
a0555d5
merge:from next branch
aravindkarnam May 6, 2025
aaf0591
fix: removed unnecessary imports and installs
aravindkarnam May 6, 2025
206a9df
feat(crawler): add session management and view-source support
unclecode May 8, 2025
76dd86d
Merge remote-tracking branch 'origin/linkedin-prep' into next
unclecode May 8, 2025
a3e9ef9
fix(crawler): remove automatic page closure in screenshot methods
unclecode May 12, 2025
897e017
Set version to 0.6.3
unclecode May 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
feat(linkedin): add prospect-wizard app with scraping and visualization
Add new LinkedIn prospect discovery tool with three main components:
- c4ai_discover.py for company and people scraping
- c4ai_insights.py for org chart and decision maker analysis
- Interactive graph visualization with company/people exploration

Features include:
- Configurable LinkedIn search and scraping
- Org chart generation with decision maker scoring
- Interactive network graph visualization
- Company similarity analysis
- Chat interface for data exploration

Requires: crawl4ai, openai, sentence-transformers, networkx
  • Loading branch information
unclecode committed Apr 30, 2025
commit 50f0b83fcd4e951b7109b653d14bc3a04ca604a8
126 changes: 126 additions & 0 deletions docs/apps/linkdin/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Crawl4AI Prospect‑Wizard – step‑by‑step guide

A three‑stage demo that goes from **LinkedIn scraping** ➜ **LLM reasoning** ➜ **graph visualisation**.

```
prospect‑wizard/
├─ c4ai_discover.py # Stage 1 – scrape companies + people
├─ c4ai_insights.py # Stage 2 – embeddings, org‑charts, scores
├─ graph_view_template.html # Stage 3 – graph viewer (static HTML)
└─ data/ # output lands here (*.jsonl / *.json)
```

---

## 1  Install & boot a LinkedIn profile (one‑time)

### 1.1  Install dependencies
```bash
pip install crawl4ai openai sentence-transformers networkx pandas vis-network rich
```

### 1.2  Create / warm a LinkedIn browser profile
```bash
crwl profiler
```
1. The interactive shell shows **New profile** – hit **enter**.
2. Choose a name, e.g. `profile_linkedin_uc`.
3. A Chromium window opens – log in to LinkedIn, solve whatever CAPTCHA, then close.

> Remember the **profile name**. All future runs take `--profile-name <your_name>`.

---

## 2  Discovery – scrape companies & people

```bash
python c4ai_discover.py full \
--query "health insurance management" \
--geo 102713980 \ # Malaysia geoUrn
--title_filters "" \ # or "Product,Engineering"
--max_companies 10 \ # default set small for workshops
--max_people 20 \ # \^ same
--profile-name profile_linkedin_uc \
--outdir ./data \
--concurrency 2 \
--log_level debug
```
**Outputs** in `./data/`:
* `companies.jsonl` – one JSON per company
* `people.jsonl` – one JSON per employee

🛠️ **Dry‑run:** `C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee` uses bundled HTML snippets, no network.

### Handy geoUrn cheatsheet
| Location | geoUrn |
|----------|--------|
| Singapore | **103644278** |
| Malaysia | **102713980** |
| United States | **103644922** |
| United Kingdom | **102221843** |
| Australia | **101452733** |
_See more: <https://www.linkedin.com/search/results/companies/?geoUrn=XXX> – the number after `geoUrn=` is what you need._

---

## 3  Insights – embeddings, org‑charts, decision makers

```bash
python c4ai_insights.py \
--in ./data \
--out ./data \
--embed_model all-MiniLM-L6-v2 \
--top_k 10 \
--openai_model gpt-4.1 \
--max_llm_tokens 8024 \
--llm_temperature 1.0 \
--workers 4
```
Emits next to the Stage‑1 files:
* `company_graph.json` – inter‑company similarity graph
* `org_chart_<handle>.json` – one per company
* `decision_makers.csv` – hand‑picked ‘who to pitch’ list

Flags reference (straight from `build_arg_parser()`):
| Flag | Default | Purpose |
|------|---------|---------|
| `--in` | `.` | Stage‑1 output dir |
| `--out` | `.` | Destination dir |
| `--embed_model` | `all-MiniLM-L6-v2` | Sentence‑Transformer model |
| `--top_k` | `10` | Neighbours per company in graph |
| `--openai_model` | `gpt-4.1` | LLM for scoring decision makers |
| `--max_llm_tokens` | `8024` | Token budget per LLM call |
| `--llm_temperature` | `1.0` | Creativity knob |
| `--stub` | off | Skip OpenAI and fabricate tiny charts |
| `--workers` | `4` | Parallel LLM workers |

---

## 4  Visualise – interactive graph

After Stage 2 completes, simply open the HTML viewer from the project root:
```bash
open graph_view_template.html # or Live Server / Python -http
```
The page fetches `data/company_graph.json` and the `org_chart_*.json` files automatically; keep the `data/` folder beside the HTML file.

* Left pane → list of companies (clans).
* Click a node to load its org‑chart on the right.
* Chat drawer lets you ask follow‑up questions; context is pulled from `people.jsonl`.

---

## 5  Common snags

| Symptom | Fix |
|---------|-----|
| Infinite CAPTCHA | Use a residential proxy: `--proxy http://user:pass@ip:port` |
| 429 Too Many Requests | Lower `--concurrency`, rotate profile, add delay |
| Blank graph | Check JSON paths, clear `localStorage` in browser |

---

### TL;DR
`crwl profiler` → `c4ai_discover.py` → `c4ai_insights.py` → open `graph_view_template.html`.
Live long and `import crawl4ai`.

Loading