Introducing Kinase, B12’s Web Content Labeling Framework

The web contains a variety of content that is easily accessible to humans but ill-formed for machines. Many organizations have spent countless hours building user interfaces and algorithms for web scraping, schema inference, and structured data extraction from the vast corners of the web.

7 July, 2017· 6 min read

by Joseph Botros and Adam Marcus

Today, we’re excited to announce the open source release of Kinase, a framework for creating Chrome extensions to quickly label content on the web for data collection and machine learning applications. Through a simple API, you can tell Kinase the type of content you’d like to label (e.g., product listings, company logos), and Kinase will generate a Chrome extension that allows your team to label this kind of data on the web.

With Kinase, we hope that the many companies, academic institutions, and journalists who want to add structure to unstructured data on the web can focus on their applications without having to custom-design a user interface for acquiring that data.

A Brief History of Structuring Data on the Web

Web structured data extraction has been a commercial, academic, and journalistic fascination for a long time. Here are just a small set of efforts in this vein:

Various companies including import.io and the now-defunct Kimono Labs have been established with the goal of turning structured data embedded on websites into machine-accessible APIs. We hope to make user interfaces like these easier to build with Kinase.
Open source projects such as Scrapy and Portia help programmatically or visually create web crawlers to scrape large collections of structured data from websites.
Vertical-specific efforts like that at Locu use human- and machine-powered pipelines to extract valuable information such as price lists.
Google’s WebTables project is probably the largest effort to turn every table Google’s crawlers find on the web into schema-aligned structured content. The research behind this project turned into Google Fusion Tables, with the project eventually making its way into Google Docs.
In journalism, data scraping is often a step along the way to identifying an insight for a story. There are many tutorials and explainers about what this process looks like.

Kinase fits into this space by providing a reusable user interface for labeling content on a website. It does not offer complex extraction algorithms or methods of extracting large paginated collections of data from websites. Instead, it makes it simple for users to click on text and other media (e.g., a product’s title and photo), specify which fields in your schema the clicked-on content is relevant to (e.g., title and picture), and save this labeled content via whatever API you provide.

Key Concepts

There are three Kinase concepts that it helps to understand before using Kinase or creating your own Kinase extension: annotations are the fields you would like to label, mappings are the specific values of those fields on a particular website, and contexts are groupings of mappings for a particular labeling session. We expand on each concept below.

Annotations

Annotations represent the labels for which you’d like to select content. For instance, you might want to map all the products from a company’s website to a products annotation.

Content selected for an annotation must be mapped to one of its fields. In the case of our products annotation, these fields might include a name, description, and photo. Each field in an annotation can only be mapped to a specific type of content: in our example, text, rich-text, and image, respectively.

Our products annotation can be mapped to a variable number of products containing that set of fields. Kinase also allows you to enforce a single mapping for an annotation, as described in the next section.

Mappings

The content a user selects for an annotation is called a mapping. A single mapping to products would include the actual text taken from the web for a product’s name and description, as well as an image for its photo.

Each mapped field in a mapping contains the content taken from the website (with any additional user edits) and the original source of that content (the URL of the website it was taken from, and a unique CSS selector specifying its container).

If specified, an annotation can support multiple mappings (when selecting content for products, the multiple mappings for that annotation might represent a store’s entire product catalog).

Contexts

In Kinase, annotations and their mappings are grouped together in a context. This context is keyed by an arbitrary string and might represent a specific user labeling content in the extension or a project that content is being labeled for. If a user were doing research on both the laptop and desktop market, they might use a separate context to represent each market with each context containing the products annotation.

By default, all mapped content is stored in a default context, so if you don’t need to switch contexts you can ignore the concept entirely.

Using Kinase

Using Kinase involves creating your own instance of Kinase, programmatically telling Kinase what the data you’re labeling looks like, and then generating a Chrome extension that your organization can use to label data with that schema.

If you like working through examples, check out our example Kinase-based extension to see how things fit together, and pay special attention to the example extension creation script.

Here’s a brief walkthrough of how the process works:

First, install Kinase with npm install kinase (or better, yarn add kinase).
Then, you can create your own derived extension:

const Kinase = require(‘kinase’)
const extension = new Kinase(options)

Running this code will create a zipfile of your extension at the path provided in a required options.output parameter. You can configure your instance of Kinase with things like a custom API for reading and saving data in structured form based on the configuration options you provide.

How We Use Kinase

Within B12, we use Kinase in various ways to power our Smart Websites:

Porting over old customer websites. When customers who have an existing website request a Launch Boost, our experts use Kinase to label all of the customers’ old structured data (e.g., their team, products, services) so that we can reuse this information on the website we are designing.
Training machine learning models. When a customer first joins B12, our robots algorithmically design a website for the customer in 60 seconds with as much of the content from their old website and social media presence as possible displayed correctly. To do this, we use a number of models including logo classifiers, about text classifiers, and collection (e.g., teams, products) classifiers that require training data. We’re currently exploring using the data that is generated from Kinase labeling sessions to train these models so that they become more accurate in the future.

These use cases highlight why building a Chrome plugin for these purposes is helpful. Our experts can enable Kinase on a customer’s old website or Facebook page, label the structured data they would like to ingest, and save this data via our internal API. We provide the schema of the data we’d like our experts to label and an API for saving the labeled data, and Kinase provides a reusable labeling interface that interacts nicely with any webpage.

Under the Hood

Kinase is written in React and transpiled from ES6, while styles for each React component are compartmentalized using CSS Modules. Our Webpack build does all the heavy lifting.

Application state is managed using Redux. Because the Kinase interface lives inside a content script that belongs to each tab, we persist the central redux store on the extension’s background page, using react-chrome-redux to facilitate that communication.

One interesting gotcha we ran into in implementing Kinase is sandboxing the Kinase interface from interfering website styles. The Chromium sidebar API was deprecated last year, so we needed to inject our react root into the page itself. We then explored the Shadow DOM, a spookily-named new technology for creating web components. Our new, shadowy Kinase was perfectly isolated from other site styles but wasn’t actually functional — React events aren’t properly propagated inside the Shadow DOM. We currently use the all: initial CSS rule on specific interface elements to combat commonly occurring style issues, but we’re still working to tackle this issue.

Conclusion

We hope that you’ll find Kinase as exciting as we do in helping to power your own structured data applications! Check it out on GitHub to learn more and get started with building your own extension. And if you find our work interesting, come work with us!

Thank you to Ted Benson for giving feedback on earlier drafts of this post.

Introducing Kinase, B12’s Web Content Labeling Framework