Sub-word Deep NNs are used to evaluate documents. This gives a boost in accuracy over classic approaches like TF-IDF and Word2Vec, that fail on spelling mistakes.
We rely on large models, pre-trained on huge amounts of texts. Reusing the learned language model helps to drastically reduce the learning time.
Our algorithm cleans and analyzes the data to decide how to optimize the learning process.
You can make data predictions not only from API but UI as well, downloading results in most popular formats: CSV, Excel, and PDF
*We are rolling out learning modules one by one. Email us if you need a module, but it's not available during the private beta.
Classify documents into categories (like topics, genres, or a name of a person or a service it should be routed to).
A special module that looks for emotional sentiment (like how happy, angry, excited, or sad a text is).
Use regression to turn documents into scoring values (like value, appropriateness, urgency, or priority).
Label parts of text with tags (like entity detection, text summarization, or key-words finding).
Compare two documents to make a decision (like similarity or relationship detection).
An experimental module that can try to generate a lookalike text (like song lyrics, fake comments, etc.)
Create a new Model. Chose the problem type (like Classification) and load a correctly formatted CSV file for training (with desired labels).
Our service will start working on the file. You can monitor the progress on the dashboard. Once it's done, you'll be ready to make some predictions.
Load CSV, Predict, and download the result.
|1||Call Centre Officer||Customer Service||0.94|
|2||Iteration Manager/Scrum Master||IT||0.99|
|3||CNC Machinist||Factory Engineering||0.94|
You can train once and predict as many times as you want, reusing the model.
Info about us and the service
It depends. In DarkCat we have a fixed amount of models that process data all the time, so we've invested in training them. They are up to 20 times larger, slower for inference, do real hyperparameter tuning, and can train for days. They are better that anything that we tried off the shelf. When we repurposed DarkCat for DataCat we doubled, tripled, and quadrupled on the idea that trading a bit of accuracy for x100 speed up of the training process - is good. This is also allowed us to store much more models on the same amount of storage. But don't get this wrong - DataCat is still much more accurate than most of the benchmarks we tried. If we can get our hand on a huge amount of GPU resources, we will start offering training for bigger models for a longer time. If you are a data scientist with a dedicated machine to train your models for a day or two - you are expected to have a slightly higher accuracy. If you have it lower - you are probably doing something wrong :) Also, remember that NLP progress never stops. If you are reading this in a not so distant future, it's possible that a much better approach has been found, and we had no time to implement it into the service.
DataCat is a rebranded and simplified version of DarkCat - a text analysis module written for DarkSentinel crawler. We had to process lots of unstructured text data, littered with html markup and spelling mistakes, so we coupled sub-word deep neural networks with a distributed batch processing queue that can run workers on consumer-grade hardware.
We are in a text processing âbusinessâ (or hobby. It depends).If you have structured relational data and don't have a data scientist - try using auto-sklearn, H2O autoML or CatBoost. It's a solved problem.
We can do online ones, but we have no need for them in DarkCat. We mainly get data in batches and process them on consumer grade computers, that are not 100% online. That influenced the philosophy of the initial service.
Cloud GPU market is insanely overpriced right now (2018), due to the hype, GPU market monopoly, and legal issues for data centers. Consumer GPUs in a PC pay off in 3-4 weeks compared to cloud offerings. Over a standard 3-years use cycle they are 30+ times more cost efficient. DataCat and DarkSentinel are not commercial services, so we have to use the resources of our spare machines.
Thought we have several machines in the network munching data at any given time, most of them are configured in an egotistical fashion and do the training and predictions of their own models (DarkSentinelâs DarkCat workers). They take public jobs only when they have spare resources. Right now weâve dedicated only a small amount of resources to serve public jobs 100% (DataCat workers), so we have a real queue.
Our AutoML has probably done something stupid. If you are sure it's not a data set problem and are ok with us looking at your data, you can email us the model UID and UIDs of failed predictions. Will try to find what had happened.
You can't. We are not a company.
No. We want to stay out of for-profit business. But if there is an interest, we will probably realise a docker worker container, that you can put on your GPU machine and use it's resources for your tasks.
Right now all the data processing takes place on machines owned by us (DataCat and DarkCat). Workers do not create local copies and they exist only on an encrypted storage. If we distribute a public worker, you will have an option to choose what the worker processes, and if your model can be processed publicly, on our machine, or only on yours. We, personally, thik that most datasets should be public, as it helps others to develop better things. Not all, but most.
Probably not. We are mostly build on open source software like Tensorflow, Django, Docker and others. You can use them for free right now, and you don't need a large team to make something good out of it. If we'll make something good, we'll probably open source it as well, so you will have it too. There is no secret technology - just public theology and a knowledge how to use it
We are not that interested. All of us have some cool jobs and are already spammed by Facebook, Google, Amazon and the others. If you are really-really interested - send us an email. Your offering should be something interesting. If we were in this for the money, we would've made a paid service.
Unfortunately none of us speaks english natively. You can email mistakes and bug to us and we will fix them.
© 2017-2019 DataCat
This is a non-profit service. Public accounts are limited, but full-featured ones are granted, not sold.
This service supports English only. We are a very small team, so we can't do all the languages.