Tue, Oct 26, 21, datasets for Scikt-learn, public google and nlp projects with awesome-public-datasets, Open Images V6
This is a draft, the content is not complete and of poor quality!
Thi
๐ Note: Resources for DS & ML & DL.
Articles
- Elite Data Science โ Datasets for Data Science and Machine Learning
Create artificial dataset
- sklearn dataset module:
from sklearn import datasets
. This contains also some popular reference datasets.
Source of datasets
- awesome-public-datasets โ A topic-centric list of HQ open datasets.
- Built-in datasets in Scikit-Learn.
- BuzzFeedNews/everything โ data from BuzzFeed.
- COCO โ Common Objects in Context.
- Data Hub Datasets collection โ high quality data and datasets organized by topic.
- data.gov โ a large dataset aggregator and the home of the US Governmentโs open data.
- data.world โ The Cloud-Native Data Catalog.
- FiveThirtyEight โ hard data and statistical analysis to tell stories about politics, sports, societal matters and more.
- Google Dataset Search.
- Google Trends Datastore
- Google AI Datasets โ In order to contribute to the broader research community, Google periodically releases data of interest to researchers in a wide range of computer science disciplines.
- Kaggle Datasets.
- NLP-progress.
- Open Images V6
- Quandl โ your perfect choice for testing your machine learning algorithms and donโt waste your time on cleaning data.
- r/datasets.
- Stanford Large Network Dataset Collection.
- UCI
- TensorFlow Datasets
- The Yahoo Webscope Program
- torchvision.datasets
- WHU-RS Datasets โ Dataset Collection by Group of Photogrammetry and Computer Vision (GPCV) at Wuhan University.
Specific Datasets
- COCO Dataset โ a large-scale object detection, segmentation, and captioning dataset.
- Dataset samples from Machine Learning Mastery.
- Fruit-Images-Dataset โ A dataset of images containing fruits and vegetables.
- google-landmark โ Dataset with 5 million images depicting human-made and natural landmarks spanning 200 thousand classes.
- ImageNet โ ImageNet is an image database organized according to the WordNet hierarchy.
- Insight - BBC News Datasets
- Large-scale CelebFaces Attributes (CelebA) Dataset
- Large Movie Review Dataset (IMDB)
- MIT Places Database for Scene Recognition.
- Sarcasm detection dataset.
- UEA & UCR Time Series Classification Repository
- WordNet โ A Lexical Database for English.
Vietnamese
๐ Note: Resources for DS & ML & DL.
- IWSLTโ15 English-Vietnamese data (small from Stanford).
- NLP-progress - Vietnamese
- PhoBERT โ Pre-trained language models for Vietnamese.
- PhoW2V (2020): Pre-trained Word2Vec syllable- and word-level embeddings for Vietnamese.
- ViText2SQL (EMNLP 2020 Findings): A dataset for Vietnamese Text2SQL semantic parsing.
- VnCoreNLP (NAACL 2018): A Vietnamese NLP pipeline of word (and sentence) segmentation, POS tagging, named entity recognition and dependency parsing.
Sample datasets
- Iris flower dataset (
from sklearn.datasets import load_iris
). - Labeled Faces in the Wild Home (
from sklearn.datasets import fetch_lfw_people
). - pydatafaker โ A python package to create fake data with relationships between tables.
- The digits dataset (
sklearn.datasets.load_digits
).
Tools
- TimeSynth โ A Multipurpose Library for Synthetic Time Series Generation in Python.
The following wiki, pages and posts are tagged with
Title | Type | Excerpt |
---|---|---|
basic setup using mac's new gpu | post | Wed, Oct 20, 21, initial setup on mac machine |
Resources for DS & ML & DL | post | Mon, Oct 25, 21, bugs and tuts lists books services & api frameworks |
Data combining using pandas | post | Tue, Oct 26, 21, Coupling multiple dataframes together uisng dataFrame and series |
Dataset Collection for dl ml sources | post | Tue, Oct 26, 21, datasets for Scikt-learn, public google and nlp projects with awesome-public-datasets, Open Images V6 |
Practical Machine Learning Tools and Techniques | post | Tue, Dec 28, 21, owerpoint slides for Chapters 1-12. This is a very comprehensive teaching resource, with many PPT slides covering each chapter of the book |
meet-puppeteer.md | post | javascript๋ก ๋ธ๋ผ์ฐ์ ์๋ํ |
Machine learning, deep learning, AI | page | DL/ML concept google search model ๐๐ฟ๐๐ถ๐ณ๐ถ๐ฐ๐ถ๐ฎ๐น ๐๐ป๐๐ฒ๐น๐น๐ถ๐ด๐ฒ๐ป๐ฐ๐ฒ ๐ฃ๐ฟ๐ผ๐ท๐ฒ๐ฐ๐ ๐๐ถ๐๐ |
webscraping | page | webscraping lessons, rapa, blackyak, 100 famous mountains, github actions and python install |