# Entity resolution

#### Problem: Account identity mismatches

All systems of records suffer from master entity data that is inconsistent, incorrect, or stale, all of which may lead to ambiguity when matching to a canonical entity database like Kernel's.

Traditional vendors rely on simple look-up fields against a static database. These lookup fields are usually the company's website or name.

<table><thead><tr><th width="118.4375">Name</th><th width="128.7578125">Website</th><th>Ambiguity</th></tr></thead><tbody><tr><td>Pepsi</td><td>fritolay.com</td><td>Which company is this? Pepsi or Fritolay? Fritolay is a subsidiary of Pepsi, but is operating on its own.</td></tr><tr><td>Dove</td><td>unilever.com</td><td>Dove is a brand within Unilever, not a legal entity on its own. But some GTM teams may still want this to resolve to Dove.</td></tr><tr><td>Oracle</td><td></td><td>This is likely the Oracle Corporation, but it may also refer to a Vancouver-based investment company, which is also called Oracle. Or it may refer any of the Oracle Corporation's subsidiary.</td></tr><tr><td>Kernel</td><td></td><td>This could refer both to Kernel, the UK-based startup building a entity database of all companies in the world, or to the US-based neuroscience startup founded by Bryan Johnson</td></tr></tbody></table>

As a human, to resolve this you would look at peripheral data in the system:

* Are there alternative names and websites associated with the record? e.g., if the Kernel record has 'kernel.ai' written somewhere, this helps narrow the confusion
* Is there an address? Pepsi is headquartered in New York, whereas Fritolay is in Texas
* What contacts are associated with the record? If all contact emails end in "oracle.com", this is likely the Oracle Corporation. But if all contacts are located in Spain, it is more likely to be Oracle Ibérica, S.R.L.,, the Spanish regional subsidiary of Oracle
* Which notes have previous users left behind? Previous opportunity and deal names may show the user's intent ("Dove | Pilot - Closed Lost") indicates the rep though of the record as a business unit rather than the legal entity itself

#### Reasoning-based entity resolution

A customer record may have a mix of information in its core field, describing it as both PepsiCo and Frito-Lay. This means that a simple name/website lookup does not guarantee that we match it to the KERN ID that the customer intended.

To resolve this, Kernel's entity resolution reviews internal and external data related to the account - from the name and website, to the address, alternative domains, existing parent relationship, related contact domains, associated opportunity names, and more.

Kernel uses the following data (where available) to make its recommendation:

* Company name
* Company website
* Alternative websites and company names
* Legal name
* Billing/Shipping address
* LinkedIn Company URL
* Account/billing emails
* Account notes
* Parent ID
* Related opportunity names
* Related contact domains
* Crawling the website of the company website (as listed)

Kernel uses all data points listed above under the assumption that not all are correct nor internally consistent.

#### Website analysis

A correct mapping between a company and its website is a prerequisite for identity resolution and enrichment. Kernel performs website analysis in three steps.

**Step 1: Removing invalid domains**

Kernel checks websites against a maintained list of common errors:

* Public email providers (`gmail.com`, `mail.ru`, `outlook.com`)
* Placeholder domains (`test.com`)
* Link shorteners (`bit.ly`, `linktr.ee`) — Kernel first attempts to follow these to see if they resolve to a valid corporate website

This step intelligently separates `facebook.com` (the company) from `facebook.com/user-profile` (e.g., an influencer's page).

**Step 2: Inferring missing domains**

For accounts without a valid website, Kernel infers the correct one using signals from across the CRM record — contact email domains, LinkedIn profiles, address lookups, alternative website fields, and web search. It also catches typos (e.g., `delotte.com` → `deloitte.com`) and cases where the account name is actually a URL.

Kernel crawls all candidate websites, feeding their content into an AI-based algorithm to determine if the pairing is accurate.

**Step 3: Website verification**

Kernel verifies each website to determine its true operational status — going far beyond a simple ping:

* **URL path resolution** — Traces to the final destination, following redirects and handling URL variations
* **Unrestricted global access** — Bypasses regional firewalls and restrictions to ensure a reliable connection
* **Intelligent content analysis** — Recognizes that a "200 OK" status code isn't enough. Flags domain parking pages, "out of business" messages, and domains repurposed for unrelated content
* **False negative prevention** — Cross-references unresponsive sites against a curated database of known corporate domains to prevent legitimate sites from being flagged due to temporary outages

The output of website analysis:

| Field               | Definition                                                                                            |
| ------------------- | ----------------------------------------------------------------------------------------------------- |
| **Website status**  | Whether the website is functional (includes 4xx/5xx codes, parking pages, "out of business" messages) |
| **Resolved domain** | The final domain after following all redirects                                                        |
| **Inferred domain** | If the original website was incorrect or missing, the corrected website                               |

Website verification is a crucial factor in determining whether the cleaning action should be "Delete."

***

#### Identity tiebreaker

When Kernel detects a conflict between a company's name and its website URL, the tiebreaker setting controls which signal takes priority. The key question: *what would your rep consider the identity to be if they only had the URL and Name?*

* **URL tiebreaker (default)** — Kernel prioritizes the company's website URL. Best when website domains are generally reliable. Ensures the live domain is trusted over variations in naming.
* **Name tiebreaker** — Kernel prioritizes the company's name and uses it to infer the correct URL. Best when names are standardized in your CRM and domains may vary.
* **Name only** — Kernel relies entirely on the company's name. Ignores other signals and uses only the inferred URL from the account name. Recommended only for very clean and consistent naming conventions.

Most teams should leave this on the default **URL tiebreaker**. Only switch to **Name tiebreaker** or **Name only** if you're certain your naming conventions are stronger than your URL data.

***

#### Entity resolution output

The output of the analysis is a KERN ID, all its component fields, as well as a human-readable reasoning justifying the decision and a confidence level, which is either High, Medium, or Low.

"Low" confidence means there was latent ambiguity, i.e., not enough information to singularly establish the correct identity without doubt. Resolving "Oracle" to the Oracle Corporation is an example of this; most humans would agree that this is sensible, but strictly speaking there are more companies in the world named Oracle.

The confidence level allows the user to trade-off coverage against certainty.

***

#### Identity viewer

The identity viewer in the Kernel app gives you direct visibility into the resolution process. For each account, you can inspect what Kernel has resolved the identity to and the reasoning behind it — helping you verify the match before taking any action.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.kernel.ai/concepts/entity-resolution.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
