Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: A labelling tool to easily extract and label Wikipedia data
116 points by mariarmestre on Dec 17, 2021 | hide | past | favorite | 13 comments
Hi HN! I am Maria, solo founder of DataQA (https://dataqa.ai/), a tool to search and label documents for various NLP tasks (e.g. entity extraction, entity linking, etc).

I have worked as a data scientist and ML engineer for the better part of a decade, and over that time have specialised mainly in applications involving natural language processing (NLP). One of the key questions I have always had at the back of my mind is whether my time was well spent. Whenever I spent more time on feature engineering or trying different models, I always wondered whether I would get better return on investment by simply labelling more data. I have created DataQA to enhance exploration & labelling of documents. It is open-source and ships with the elasticsearch text search engine which I have packaged as a python package (might be topic of a future technical post), as well as a rules-based engine to do pre-labelling of documents using NLP rules. It is very easy to install with a single pip command.

One of the key things I wanted to add to DataQA is an integration to Wikipedia. Even though wikipedia is the largest living repository of human knowledge in the world, I still always found it difficult to process it and create structured datasets for my specific applications. Since wiki pages are long-form articles, it is important to divide the text into smaller text chunks. A lot of the interesting data is also sometimes displayed in tables. With DataQA you can now upload a list of wikipedia page urls and the tool will extract the articles, process them and even parse the tables, so you can then label any entities you want. You can find a tutorial here: https://towardsdatascience.com/a-labelling-tool-to-easily-ex....

The open-source version of DataQA currently only supports csv, but I have an enterprise version with premium features such as labelling of pdfs (with understanding of tables). If you're interested in a free trial, please contact me at contact@dataqa.ai :-).



This is a great project and tutorial. IMHO you have a value prop that's much larger/better than what you're describing here.

To me it sounds like you're creating a data mining annotation tool that can work on any large corpus of free-form documents that have discoverable labels, such as medical records, legal cases, press releases, SEC filings, customer reviews, etc.

Can you speak to any of these? And do you have a pitch deck or similar ask for funding/help/advisors?


Thanks so much for your comment! You're right that this annotation tool can be used on any form of free-form documents found online. I tackled Wikipedia first because it was an obvious first choice and they have an API to read the html. This could be opened to other sources of data, but I also do not want this to become a scraping tool, so we would need to weigh costs/benefits of adding new data sources. The additional cost of adding a new source is mostly about how difficult it is to read and parse the content. In the future, I could integrate with some paying sources (e.g. news publications), where people have to pay for the content they scrape & label.

I have a pitch deck and I'm looking for all the things you mentioned :-). I can send the pitch deck to anyone interested.


Similar question, the CSV contains a collection of Wiki URLs, why not pass URLs of any website that has "free-form documents"?


The only issue here is what I mentioned in the comment above: how easy is it to read and parse the content of said website and is it legal to read the content programmatically? Do you have any website(s) in mind?


How do the results from your tool compare to Wikidata?


This is to build your own knowledge base. In many cases, Wikidata might not have the data you're looking for. For example, in the tutorial I have linked, the task is to come up with all the products released by a list of companies. Toutiao would be a product of Bytedance. This is a relation that might not exist on Wikidata (I tried to search for it but could not find it https://www.wikidata.org/wiki/Q24835387).


I added ByteDance as the creator and owner of Toutiao. (It was already listed as a "product or material produced" on the ByteDance page https://wikidata.org/wiki/Q55606242 )


Wikidata is a very complete knowledge base, but I think there is still room for a tool like this to be used on Wikipedia data. There might be missing information on Wikidata which is still found on Wikipedia (e.g. list of Bytedance products or investors is incomplete, or the number of employees is also missing from Wikidata). This tool can be used to uncover these relationships for your application, or to feed back into Wikidata if it's of public interest.

It could also be that you are trying to extract data to train a named entity recognition model. In that case, you want to extract the paragraph or sentence that has the information and the label.


Why not use that to enrich Wikidata?


You can add relationships to Wikidata. Something like "is a product of" probably already has a property, and would be well within the scope of Wikidata.


This is really cool and reminds me of the Microsoft tool PICL (https://www.microsoft.com/en-us/research/video/machine-teach...) I would love to see a video demo of the product.


Hi! Thanks for your feedback. I had not come across this, but it looks quite similar :) (at least conceptually). I don't have a video, but there is a short gif in the repository. I am planning to make a video at some point though!


This seems to coincide with the new update on OpenAI's GPT-3 ability to reference links from searching, A new version of GPT-3 that can use a web browser to more accurately answer questions: https://t.co/bzaaP9XnZm




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: