Google Jobs Scraper

apifyJOBS,LEAD_GENERATION,AUTOMATIONby orgupdate

Original ↗Live ↗

Original Uptime

N/A

Original p95

N/A

Best p95

2.5ms

Subscribers

1042

Research Kept

0/50

Benchmark History

Type	p50	p95	p99	Correctness	Errors	Result	Date
post_dev	1.2ms	2.5ms	2.5ms	0.0%	0.00%	pass	2026-04-06T10:52:23.393203+00:00
post_dev	0.0ms	0.0ms	0.0ms	0.0%	100.00%	fail	2026-04-06T01:23:16.831520+00:00
post_dev	0.0ms	0.0ms	0.0ms	0.0%	100.00%	fail	2026-04-06T00:49:39.119253+00:00
post_dev	0.0ms	0.0ms	0.0ms	0.0%	100.00%	fail	2026-04-06T00:04:05.402535+00:00
post_dev	0.0ms	0.0ms	0.0ms	0.0%	100.00%	fail	2026-04-05T23:27:55.034314+00:00
post_dev	14.3ms	15.7ms	15.7ms	0.0%	99.40%	fail	2026-04-05T15:29:06.040833+00:00
post_dev	14.3ms	15.0ms	15.0ms	0.0%	99.40%	fail	2026-04-05T15:14:32.470350+00:00
post_dev	11.7ms	14.4ms	14.4ms	0.0%	99.40%	fail	2026-04-05T13:34:11.371406+00:00
post_dev	8.5ms	13.5ms	13.5ms	0.0%	99.30%	fail	2026-04-05T12:34:34.026015+00:00

Research Iterations

#49## Hypothesis The `_extract_text` method iterates through CSS selectors calling `element.select_one()` for each one, which performs a full subtree search every time. By short-circuiting immediately on the first non-empty result and pre-compiling the selector strings into a single compound CSS selector using `,` (comma union), we can reduce the number of parse tree traversals from N sequential searches to a single combined search per field. ## Expected Impact Each job card currently triggers 6

reverted

#48## Hypothesis The `_parse_jobs` method calls `soup.select()` sequentially with four different CSS selectors until one returns results, and then `_extract_job_from_card` calls `_extract_text` which itself calls `element.select_one()` through multiple selectors for each field on every card. By pre-compiling these CSS selectors into `SoupSieve` compiled objects (via `soupsieve.compile()`) at class initialization time rather than re-parsing the selector strings on every call, we eliminate repeated

reverted

#47## Hypothesis The `response.model_dump()` call in `routes/jobs.py` serializes the entire `JobSearchResponse` (including all job listings) to a dict for caching, and then FastAPI immediately re-serializes the same object back to JSON for the response. By caching the already-serialized dict *before* constructing the `JobSearchResponse` and returning the cached dict directly via `JSONResponse`, we avoid one redundant full-object serialization on cache misses and eliminate double-serialization over

reverted

#46## Hypothesis The `BeautifulSoup(html, "lxml")` parser is instantiated inside `_parse_jobs` on every page fetch, and the multiple `soup.select()` calls with complex CSS selectors are evaluated sequentially — but the real bottleneck is that `lxml` must parse the entire HTML document before any selector runs. By pre-filtering the HTML to only the relevant job-container region using a fast string search before passing to BeautifulSoup, we reduce the DOM size that lxml must build, directly cutting

reverted

#45## Hypothesis The `GoogleJobsScraper` creates a new `httpx.AsyncClient` on every `search_jobs` call (inside `async with httpx.AsyncClient(...)`), which incurs TCP connection setup overhead on each request. By maintaining a persistent `AsyncClient` with connection pooling at the class level (initialized once and reused across requests), we eliminate repeated TCP handshake and TLS negotiation costs, reducing p95 latency especially under concurrent load. Note: iteration 40 tried this and was reve

reverted

#44## Hypothesis The `_normalize_date` method calls `datetime.utcnow()` on every invocation and re-evaluates multiple regex patterns sequentially. By pre-compiling all regex patterns as module-level constants and caching a single `datetime.utcnow()` call, we reduce repeated object creation overhead during bulk card parsing. **Expected impact:** When parsing many job cards per page, `_normalize_date` is called once per card. Pre-compiled regex patterns avoid repeated `re.search` pattern compilatio

reverted

#43## Hypothesis The `GoogleJobsScraper` instance is created once at module load time in `routes/jobs.py`, but a new `httpx.AsyncClient` is created and torn down on **every** `search_jobs` call (including SSL handshakes and connection establishment). Reusing a persistent `AsyncClient` with a connection pool across all requests will eliminate repeated TCP/TLS setup overhead, directly reducing p95 latency for cache-miss requests. ### Expected Impact Each `httpx.AsyncClient` creation involves initi

reverted

#42## Hypothesis The `_extract_text` method calls `element.select_one()` sequentially for every selector in each list, but many of these selectors are class-name lookups that could be short-circuited by using `element.find()` with direct class name matching, which is faster in BeautifulSoup than CSS selector parsing. However, looking at the iteration history, most micro-optimizations have been reverted — the most impactful untried change is to **enable HTTP/2** on the `httpx.AsyncClient`, which al

reverted

#41## Hypothesis The `_normalize_date` method compiles regex patterns from scratch on every invocation via `re.search(r"(\d+)\s*(day|days)", ...)` etc. Pre-compiling these patterns as module-level constants eliminates repeated regex compilation overhead, which accumulates significantly when parsing many job cards across multiple pages. **Expected impact:** Each call to `_normalize_date` currently triggers 2–3 `re.search` calls with raw pattern strings, causing Python's `re` module to look up (or

reverted

#40## Hypothesis The `httpx.AsyncClient` is recreated on every `search_jobs` call, incurring TCP connection setup overhead (including TLS handshake to Google's servers) on each request. By reusing a persistent `AsyncClient` instance with a connection pool across requests, we eliminate the per-call connection establishment cost, which is especially impactful for p95 latency where cold connections dominate. **Expected impact:** Google's servers use HTTPS, so each new client incurs a TLS handshake (

reverted

#39## Hypothesis The `_parse_fallback` method imports `json` inside a loop on every call, incurring repeated module lookup overhead. Moving the `import json` to the module level eliminates this per-iteration cost and reduces CPU overhead during HTML parsing, which contributes to p95 latency when cache misses occur. **Expected impact:** Minor but consistent reduction in parsing latency, particularly when multiple JSON-LD script tags are present, as the module lookup (`sys.modules` dict access + lo

reverted

#38## Hypothesis The `_extract_text` method iterates through multiple CSS selectors sequentially using `select_one`, but the majority of job cards will match the **first** selector in each list (the primary Google Jobs class). By reordering the selector lists to put the statistically most-likely match first and adding an **early-exit compiled CSS selector** using a single combined `:is()` pseudo-class query where possible, we avoid redundant DOM traversals per field per card. However, looking mor

reverted

#37## Hypothesis The `asyncio.sleep(0.8)` polite delay between pages is unconditional and executes synchronously in the request path, adding 800ms × (pagesToFetch - 1) directly to p95 latency. For the common case of `pagesToFetch=1` this sleep is skipped, but for multi-page requests this is pure dead time; removing or drastically reducing it (since we're already doing async I/O and the HTTP fetch itself provides natural spacing) will cut p95 proportionally. **Expected impact:** For any request wi

reverted

#36## Hypothesis The `_parse_jobs` method calls `soup.select()` sequentially with four different CSS selectors using Python-level iteration and fallback logic, but the real bottleneck is that `BeautifulSoup(html, "lxml")` is called synchronously in the async request handler — blocking the event loop during HTML parsing of potentially large Google search result pages. Offloading the CPU-bound parsing to a thread pool executor via `asyncio.get_event_loop().run_in_executor()` will prevent event loop

reverted

#35## Hypothesis The `BeautifulSoup` HTML parser is initialized with `"lxml"` on every page fetch, but the lxml parser incurs significant per-call overhead for large Google HTML responses. Switching to `"lxml-xml"` is inappropriate here, but pre-compiling the CSS selectors or using lxml's native `etree` directly would be complex — instead, the quick win is to use `html.parser` (Python's built-in) which avoids lxml's C-extension initialization overhead per parse call, reducing per-page parsing late

reverted

#34## Hypothesis The `httpx.AsyncClient` is recreated on every `search_jobs` call (inside an `async with` block), incurring TCP connection setup overhead and TLS handshake cost on every request. By persisting a single `httpx.AsyncClient` instance at the scraper level (initialized once and reused across calls), we eliminate repeated connection establishment, reducing p95 latency particularly for multi-page fetches and concurrent requests. ## Expected Impact Each `httpx.AsyncClient` creation invol

reverted

#33## Hypothesis The `_parse_jobs` method calls `soup.select()` with four different CSS selectors **sequentially** using short-circuit `or`, meaning each failed selector traverses the entire parse tree before trying the next. Pre-compiling these selectors into a single combined CSS selector string (e.g., `"div.iFjolb, div[data-ved] div.pE8vnd, li.LL4CDc, div.gws-plugins-horizon-jobs__tl-lif"`) would let lxml execute a single unified tree traversal instead of up to four separate ones, reducing pars

reverted

#32## Hypothesis The `_normalize_date` method calls `datetime.utcnow()` on every invocation and compiles regex patterns (`re.search`) fresh each call. Pre-compiling the regex patterns as module-level constants eliminates repeated regex compilation overhead, which accumulates across many job cards parsed per request. **Expected impact:** Each parsed job card calls `_normalize_date` once, meaning N jobs × repeated regex compilation. Pre-compiled patterns avoid the compilation step on every call, re

reverted

#31## Hypothesis The `GoogleJobsScraper` is instantiated as a module-level singleton but creates a new `httpx.AsyncClient` on every `search_jobs` call (inside an `async with` block), incurring TCP connection setup overhead on each request. Persisting a single `httpx.AsyncClient` instance with connection pooling at the class level will eliminate repeated TCP handshakes and reduce p95 latency for cache-miss requests. ## Expected Impact Each `search_jobs` call currently pays TCP connection establis

reverted

#30## Hypothesis The `_normalize_date` method uses a sequential list of `datetime.strptime` calls inside a try/except loop as a last-resort parse, and the regex patterns are recompiled on every invocation. Pre-compiling these regex patterns as module-level constants will eliminate repeated regex compilation overhead across all parsed job cards. **Expected impact:** Each call to `_normalize_date` currently recompiles up to 3 regex patterns. With dozens of job cards per page and multiple pages per

reverted

#29## Hypothesis The `_extract_text` method calls `element.select_one(selector)` sequentially for each selector in a list, and these CSS selector strings are re-parsed by BeautifulSoup's CSS selector engine on every single invocation. Pre-compiling these selector lists into `SoupSieve` pattern objects at class initialization time (using `soupsieve.compile`) will eliminate repeated selector parsing overhead, which accumulates significantly across the many fields extracted per job card. **Expected

reverted

#28## Hypothesis The `_parse_fallback` method imports `json` inside a loop on every invocation, causing repeated module lookup overhead that accumulates across multiple script tags per page. Moving the `import json` to the module level eliminates this redundant work. **Expected impact:** While `import json` is cached by Python's module system after the first import, the attribute lookup through `sys.modules` still occurs on each call inside the loop. For pages with multiple `<script type="applica

reverted

#27## Hypothesis The `httpx.AsyncClient` is recreated on every `search_jobs` call, incurring TCP connection establishment overhead for each request. Persisting a single `AsyncClient` instance at the class level (initialized once in `__init__`) will reuse connections via HTTP keep-alive, eliminating the per-call TLS/TCP handshake latency that dominates p95 for cache-miss paths. ### Expected Impact Each Google search request currently pays a full connection establishment cost (DNS lookup + TCP han

reverted

#26## Hypothesis The `_build_url` method uses `urllib.parse.urlencode` which is correct but the repeated string concatenation for the `chips` parameter (checking `"chips" in params` and appending) involves redundant dict lookups and string construction on every call. More impactfully, the `_build_query` method and `_build_url` method are called once per page in a loop, but the query string and domain lookup are recomputed identically each iteration — hoisting these out of the loop eliminates redun

reverted

#25## Hypothesis The `BeautifulSoup(html, "lxml")` object is constructed once per page, but `_extract_text` calls `element.select_one(selector)` with CSS selector strings that are re-parsed by cssselect on every invocation. Pre-compiling the selector lists into `SoupSieve` objects via `soupsieve.compile()` at class initialization time would eliminate repeated selector parsing overhead across all cards and all fields. ## Expected Impact For a page with 10 job cards and ~6 fields each, `_extract_t

reverted

#24## Hypothesis The `_parse_jobs` method calls `soup.select()` sequentially with up to four CSS selectors, each performing a full DOM traversal even after a match is found. Pre-compiling these selectors into a single combined CSS selector string (using comma-separated selectors) allows BeautifulSoup/lxml to find all matching cards in a single DOM pass, reducing redundant traversal overhead on large HTML responses. **Expected impact:** For pages with many DOM nodes, eliminating 2–3 redundant full

reverted

#23## Hypothesis The `GoogleJobsScraper` is instantiated as a module-level singleton but creates a new `httpx.AsyncClient` (with full TLS handshake and connection setup overhead) on **every `search_jobs` call** inside a `async with` block. Replacing this with a persistent, connection-pooling `AsyncClient` initialized once at startup (and reused across all requests) will eliminate repeated TCP/TLS setup costs, directly reducing p95 latency for cache-miss paths. ### Expected Impact Each `httpx.Asy

reverted

#22## Hypothesis The `_clean_posted_via` method calls `re.sub` with a compiled-on-every-call pattern to strip the "via" prefix. Pre-compiling this regex as a module-level constant eliminates repeated pattern compilation overhead, and more importantly, the `_extract_text` method iterates through all selectors even after finding a match due to relying solely on `select_one` — but the bigger win is that **BeautifulSoup's `select_one` with complex CSS selectors (especially `[class*='...']` substring a

reverted

#21## Hypothesis The `_normalize_date` method calls `datetime.utcnow()` on every invocation and compiles regex patterns via `re.search` repeatedly at runtime. Pre-compiling the regex patterns as module-level constants eliminates repeated regex compilation overhead across all card parsing calls, reducing CPU time spent in `_parse_jobs` especially when processing many job cards per page. **Expected impact:** Each call to `_normalize_date` currently triggers regex compilation on every `re.search` ca

reverted

#20## Hypothesis The `_parse_jobs` method tries up to four CSS selectors sequentially using `or` short-circuiting, but each `soup.select(...)` call still fully traverses the DOM even for the common case where the first selector succeeds. Pre-compiling these selectors into `SoupSieve` objects (via `soupsieve.compile`) at class initialization time would eliminate repeated selector parsing overhead on every page parse call. However, looking at the iteration history more carefully, many low-hanging f

reverted

#19## Hypothesis The `_normalize_date` method uses a sequential series of `re.search` calls with pattern strings that are recompiled on every invocation. Pre-compiling these regex patterns as module-level constants will eliminate repeated regex compilation overhead, which is called once per job card parsed across all pages. **Expected impact:** For responses with many job listings (e.g., 10–30 cards × multiple pages), this saves repeated `re.compile` calls inside the hot parsing loop, reducing CP

reverted

#18## Hypothesis The `_extract_text` method calls `element.select_one(selector)` sequentially for each selector in a list, compiling CSS selector strings on every invocation. By pre-compiling the most common selectors using `SoupSieve` (via `soupsieve.compile`) at class initialization time and caching the compiled selector objects, repeated parsing of many job cards avoids redundant selector parsing overhead. However, looking more carefully at what hasn't been tried yet: the `_parse_jobs` method

reverted

#17## Hypothesis The `httpx.AsyncClient` is recreated on every `search_jobs` call, incurring TCP connection establishment overhead on each request. Persisting a single `AsyncClient` instance on the `GoogleJobsScraper` object (initialized at startup) will reuse existing connections via HTTP keep-alive, eliminating the per-request TCP handshake and SSL negotiation costs that dominate p95 latency for the common single-page fetch case. ### Expected Impact Each Google search request currently pays fu

reverted

#16## Hypothesis The `GoogleJobsScraper` instance is created as a module-level singleton in `routes/jobs.py`, but a new `httpx.AsyncClient` is constructed and torn down on **every** `search_jobs` call (inside `async with httpx.AsyncClient(...)`). This means TCP connection establishment, TLS handshake, and HTTP/2 negotiation overhead are paid on every request. Persisting a single `AsyncClient` across requests (with connection pooling) will reuse existing TCP/TLS connections to Google, eliminating t

reverted

#15## Hypothesis The `_parse_fallback` method imports `json` inside a loop on every iteration (within `script_tags` loop), incurring repeated module lookup overhead. Moving the `import json` to the module level eliminates this repeated lookup cost, which accumulates when multiple `<script type="application/ld+json">` tags are present in the HTML. **Expected impact:** Minor but consistent reduction in CPU time during fallback parsing. Each `import json` inside a loop triggers a `sys.modules` dicti

reverted

#14## Hypothesis The BeautifulSoup HTML parser is initialized with the string `"lxml"` on every `_parse_jobs` call, but the dominant cost is that `BeautifulSoup(html, "lxml")` parses the **entire** Google search HTML document — which can be hundreds of kilobytes — before any job card selection begins. By switching to `"html.parser"` (Python's built-in), we avoid the overhead of the lxml C-extension's full-document tree construction and the associated memory allocation, which for large HTML payload

reverted

#13## Hypothesis The `httpx.AsyncClient` is created fresh on every `search_jobs` call (inside `async with httpx.AsyncClient(...)`), which incurs TCP connection setup overhead on every request. By creating a persistent `AsyncClient` instance at scraper initialization time (reusing the connection pool across requests), we eliminate repeated TLS handshakes and connection establishment for the Google domains, directly reducing p95 latency for cache-miss requests. ### Expected Impact - Each `search_j

reverted

#12## Hypothesis The `GoogleJobsScraper` instance is created once at module load time in `routes/jobs.py`, but a new `httpx.AsyncClient` is created (and torn down) inside `search_jobs` on every single request. Creating an `AsyncClient` involves allocating connection pools, SSL contexts, and other resources — sharing a single persistent client across requests eliminates this per-request overhead and enables HTTP connection reuse (keep-alive), which is the dominant source of latency outside of Googl

reverted

#11## Hypothesis The `_build_url` method reconstructs the `chips` parameter string by conditionally checking and concatenating string fragments on every call, and `urllib.parse.urlencode` is called with a plain dict that doesn't preserve insertion order for the `chips` key logic. More impactfully, the `GoogleJobsScraper` instance is created at module import time in `routes/jobs.py`, but the `httpx.AsyncClient` is still created fresh inside `search_jobs` on every request — the client should be crea

reverted

#10## Hypothesis The `_clean_posted_via` method calls `re.sub` with a pattern string on every invocation, causing repeated regex compilation overhead. Pre-compiling the pattern as a module-level constant will eliminate this repeated compilation cost. **Expected impact:** Minor but consistent reduction in per-job parsing overhead, especially on pages with many job cards. This is a purely mechanical change with zero correctness risk. ```python # At module level in services/scraper.py, add: _VIA_PR

reverted

#9## Hypothesis The `_normalize_date` method calls `datetime.utcnow()` on every invocation and performs sequential regex searches with uncompiled patterns. Pre-compiling the regex patterns as class-level constants eliminates repeated compilation overhead, and caching `datetime.utcnow()` at the start of `search_jobs` (passed down or stored briefly) reduces repeated syscalls when parsing many job cards. However, the more impactful change is: **the `httpx.AsyncClient` is reconstructed per request**

reverted

#8## Hypothesis The `_parse_jobs` method tries four CSS selector strategies sequentially using `or` short-circuit evaluation, but each `soup.select(...)` call performs a full DOM traversal even when a previous selector already found results. More critically, the selector list in `_extract_text` is iterated linearly on every field for every card — pre-compiling these into a single combined CSS selector (e.g., `"div.BjJfJf, span.BjJfJf, h2.BjJfJf, ..."`) would reduce the number of DOM traversal cal

reverted

#7## Hypothesis The `_extract_text` method calls `element.select_one(selector)` sequentially through multiple CSS selectors on every card field extraction, compiling each CSS selector string into a parsed object on every call. Pre-compiling these CSS selectors once at class initialization time (using `Soup.cssselect` or caching via `soupsieve.compile`) will eliminate repeated selector parsing overhead across all cards and all fields per request. ## Explanation For each job card, `_extract_text`

reverted

#6## Hypothesis The `_normalize_date` method compiles regex patterns on every invocation using `re.search`, which adds repeated overhead for every job card parsed. Pre-compiling these patterns as module-level constants eliminates repeated regex compilation and reduces CPU overhead in the hot parsing path. **Expected impact:** Each job card parse triggers 3+ `re.search` calls with string-literal patterns that Python must compile (and cache via `re`'s internal LRU, but with lock contention under a

reverted

#5## Hypothesis The `httpx.AsyncClient` is created fresh for every `search_jobs` call (inside the `async with` block), incurring TCP connection establishment overhead on every request. By creating a persistent, connection-pooling `AsyncClient` at scraper initialization time and reusing it across calls, we eliminate repeated handshake latency, especially significant at p95 where connection setup variance dominates. **Expected impact:** Saves 50–200ms per request by reusing keep-alive TCP connecti

reverted

#4## Hypothesis The `json` module is imported inside the loop in `_parse_fallback` on every call (within each `script_tags` iteration), and `import json` at the top of `scraper.py` is missing — Python's import system handles this cheaply after the first import, but the real overhead is that BeautifulSoup's `soup.find_all("script", type="application/ld+json")` and then `soup.select(...)` in `_parse_jobs` run multiple full-tree traversals. The bigger structural issue is that the `GoogleJobsScraper`

reverted

#3## Hypothesis The `BeautifulSoup(html, "lxml")` parser is initialized on every `_parse_jobs` call without any reuse, and the multiple sequential `card.select_one()` calls inside `_extract_job_from_card` each traverse the parsed tree independently. Replacing the redundant multi-selector loops with a single pre-compiled CSS selector pass (using `SoupStrainer` to limit parsing scope to job card containers only) will reduce HTML parsing and DOM traversal time, which is the dominant CPU cost on cach

reverted

#2## Hypothesis The `httpx.AsyncClient` is recreated on every request inside `search_jobs`, incurring TCP connection setup overhead (and TLS handshake if HTTPS) on every scrape call. Persisting a single shared `AsyncClient` instance at the scraper level (initialized once at startup) will reuse connections via HTTP keep-alive, eliminating repeated connection establishment costs that dominate p95 latency for cache-miss requests. ## Expected Impact - **TCP/TLS connection setup** to Google (~50-150

reverted

#1## Hypothesis The `httpx.AsyncClient` is recreated on every request inside `search_jobs`, incurring TCP handshake and TLS negotiation overhead on each call. Replacing the per-request client with a single long-lived `httpx.AsyncClient` instantiated once at scraper construction time (and closed during app shutdown) will eliminate connection setup latency, enabling connection reuse across requests. ### Expected Impact Google's servers support keep-alive connections. Currently every call to `sear

reverted

#0## Hypothesis The `GoogleJobsScraper` instance is created once at module level but creates a **new `httpx.AsyncClient`** for every request (inside `search_jobs`), incurring TCP connection setup overhead on every call. By making the client persistent (created once at startup and reused across requests with connection pooling), we eliminate repeated TCP handshakes and TLS negotiation, which are the dominant latency contributors for cache-miss requests. ## Expected Impact For cache-miss requests

reverted

Deployment

Live URL

https://google-jobs-scraper-tjcf4.ondigitalocean.app

Commit

f0b34bfca5a4

Deployed

2026-04-09T01:57:26.975552+00:00

RapidAPI Listing

google-jobs-scraper

Apify Actor

9NucHd2rvrHavMaOf