Semantic Chunking with Word2Vec in the Browser

Semantic chunking involves splitting text based on semantic similarity. In our case, we’ll explore how to achieve this using Word2Vec embeddings right in your web browser. Buckle up, and let’s get started!

What Is an embedding Word2Vec?

Word2Vec is a popular word embedding technique that represents words as dense vectors in a high-dimensional space. These vectors capture semantic relationships between words, making them useful for various natural language processing (NLP) tasks.

The Secret Ingredient: Lots of Text Data

The secret to getting Word2Vec really working for you is to have lots and lots of text data in the relevant domain. If you’re building a sentiment lexicon, for example, using a dataset from the medical domain or even Wikipedia might not be effective. Choose your data wisely…

Now That We have our Vectors.

Lets use them for some similarity determinations in the browser here’s the code:

function vecDotProduct(vecA, vecB) {
    var product = 0;
    for (var i = 0; i < vecA.length; i++) {
        product += vecA[i] * vecB[i];
    }
    return product;
}

function vecMagnitude(vec) {
    var sum = 0;
    for (var i = 0; i < vec.length; i++) {
        sum += vec[i] * vec[i];
    }
    return Math.sqrt(sum);
}

function cosineSimilarity(vecA, vecB) {
    return vecDotProduct(vecA, vecB) / (vecMagnitude(vecA) * vecMagnitude(vecB));
}

The result will be a value between -1 (completely opposite) and 1 (identical). Now we can determine how similar any arbitrary string is to any other string. This sort can be used to help in a RAG solution to know what to prompt with.

Request A Demo