Deep Learning

Sub-word Deep NNs are used to evaluate documents. This gives a boost in accuracy over classic approaches like TF-IDF and Word2Vec, that fail on spelling mistakes.

Details »

Transfer Learning

We rely on large models, pre-trained on huge amounts of texts. Reusing the learned language model helps to drastically reduce the learning time.

Details »

Auto ML

Our algorithm cleans and analyzes the data to decide how to optimize the learning process.

Details »

API & Downloads

You can make data predictions not only from API but UI as well, downloading results in most popular formats: CSV, Excel, and PDF

Has all the training modules you need
All are backed up by Auto Transfer Learning

*We are rolling out learning modules one by one. Email us if you need a module, but it's not available during the private beta.




Classify

Classify documents into categories (like topics, genres, or a name of a person or a service it should be routed to).

Sentiment

A special module that looks for emotional sentiment (like how happy, angry, excited, or sad a text is).

Score

Use regression to turn documents into scoring values (like value, appropriateness, urgency, or priority).

Label

Label parts of text with tags (like entity detection, text summarization, or key-words finding).

Compare

Compare two documents to make a decision (like similarity or relationship detection).

Generate Text

An experimental module that can try to generate a lookalike text (like song lyrics, fake comments, etc.)

How it works

Interactive overview

Create a model and import a CSV

Create a new Model. Chose the problem type (like Classification) and load a correctly formatted CSV file for training (with desired labels).

Wait

Our service will start working on the file. You can monitor the progress on the dashboard. Once it's done, you'll be ready to make some predictions.


It will look something like this:
Classify (248) 840-7081
Created 11.11.2018
Training. Completed: 72%
View Edit
Score Client churn risk
Created 11.21.2017
Training failed. Retrying...
View Edit

Load CSV, Predict, and download the result

Load CSV, Predict, and download the result.


It will look something like this: (Play with it! It's interactive.)

Order Input Result Confidence
1 Call Centre Officer Customer Service 0.94
2 Iteration Manager/Scrum Master IT 0.99
3 CNC Machinist Factory Engineering 0.94
Order Input Result Confidence

Reuse the model for prediction

You can train once and predict as many times as you want, reusing the model.

Frequently asked questions

Info about us and the service

How accurate are you?

It depends. In DarkCat we have a fixed amount of models that process data all the time, so we've invested in training them. They are up to 20 times larger, slower for inference, do real hyperparameter tuning, and can train for days. They are better that anything that we tried off the shelf. When we repurposed DarkCat for DataCat we doubled, tripled, and quadrupled on the idea that trading a bit of accuracy for x100 speed up of the training process - is good. This is also allowed us to store much more models on the same amount of storage. But don't get this wrong - DataCat is still much more accurate than most of the benchmarks we tried. If we can get our hand on a huge amount of GPU resources, we will start offering training for bigger models for a longer time. If you are a data scientist with a dedicated machine to train your models for a day or two - you are expected to have a slightly higher accuracy. If you have it lower - you are probably doing something wrong :) Also, remember that NLP progress never stops. If you are reading this in a not so distant future, it's possible that a much better approach has been found, and we had no time to implement it into the service.

Why was DataCat created?

DataCat is a rebranded and simplified version of DarkCat - a text analysis module written for DarkSentinel crawler. We had to process lots of unstructured text data, littered with html markup and spelling mistakes, so we coupled sub-word deep neural networks with a distributed batch processing queue that can run workers on consumer-grade hardware.

Why is it “text only”?

We are in a text processing “business” (or hobby. It depends).If you have structured relational data and don't have a data scientist - try using auto-sklearn, H2O autoML or CatBoost. It's a solved problem.

Why do you use batch instead of online predictions?

We can do online ones, but we have no need for them in DarkCat. We mainly get data in batches and process them on consumer grade computers, that are not 100% online. That influenced the philosophy of the initial service.

Why are you targeting consumer grade for training?

Cloud GPU market is insanely overpriced right now (2018), due to the hype, GPU market monopoly, and legal issues for data centers. Consumer GPUs in a PC pay off in 3-4 weeks compared to cloud offerings. Over a standard 3-years use cycle they are 30+ times more cost efficient. DataCat and DarkSentinel are not commercial services, so we have to use the resources of our spare machines.

Why is my job not training and is still in the queue?

Thought we have several machines in the network munching data at any given time, most of them are configured in an egotistical fashion and do the training and predictions of their own models (DarkSentinel’s DarkCat workers). They take public jobs only when they have spare resources. Right now we’ve dedicated only a small amount of resources to serve public jobs 100% (DataCat workers), so we have a real queue.

On one particular data set I get an abysmal accuracy (close to random). Why?

Our AutoML has probably done something stupid. If you are sure it's not a data set problem and are ok with us looking at your data, you can email us the model UID and UIDs of failed predictions. Will try to find what had happened.

Can I buy you or invest in you?

You can't. We are not a company.

Can I pay for DataCat to have more resource available?

No. We want to stay out of for-profit business. But if there is an interest, we will probably realise a docker worker container, that you can put on your GPU machine and use it's resources for your tasks.

Is my data safe in a distributed worker environment?

Right now all the data processing takes place on machines owned by us (DataCat and DarkCat). Workers do not create local copies and they exist only on an encrypted storage. If we distribute a public worker, you will have an option to choose what the worker processes, and if your model can be processed publicly, on our machine, or only on yours. We, personally, thik that most datasets should be public, as it helps others to develop better things. Not all, but most.

Can I buy your technology?

Probably not. We are mostly build on open source software like Tensorflow, Django, Docker and others. You can use them for free right now, and you don't need a large team to make something good out of it. If we'll make something good, we'll probably open source it as well, so you will have it too. There is no secret technology - just public theology and a knowledge how to use it

Can I hire you?

We are not that interested. All of us have some cool jobs and are already spammed by Facebook, Google, Amazon and the others. If you are really-really interested - send us an email. Your offering should be something interesting. If we were in this for the money, we would've made a paid service.

I've found mistakes on the site. How can I notify you?

Unfortunately none of us speaks english natively. You can email mistakes and bug to us and we will fix them.

© 2017-2019 DataCat overpensiveness
This is a non-profit service. Public accounts are limited, but full-featured ones are granted, not sold.
This service supports English only. We are a very small team, so we can't do all the languages.