خدمات داده

با توجه به رشد سریع تکنولوژی در سالهای اخیر، محاسبات پیشرفته و بلادرنگ در انواع مختلفی از کاربردها به یک نیاز اساسی تبدیل گشته است. سامانه‏های پردازش سریع با داشتن منابع پردازشی قدرتمند، انجام محاسبات پیچیده با سرعت بسیار بالا را ممکن می‏سازند. یکی از مهمترین کاربردهای این سامانه‏ها، مباحث پیشرو در حوزه هوش مصنوعی همچون یادگیری ماشین/یادگیری عمیق، داده‏کاوی، بینایی ماشین، پردازش سیگنال و … می‏باشد. در این پردازش‏ها، عموماً یک مدل یادگیری در سطح ابتدایی پیاده‏سازی شده و سپس با استفاده از داده‏های ورودی مناسب و مربوط، عملیات یادگیری مدل انجام می‏گیرد. کیفیت و کمیت داده‏های استفاده شده در حین یادگیری، تاثیر مستقیمی با افزایش دقت مدل و کاهش سطح خطای آن دارد. مجموعه غنی پیش رو، شامل دیتاست‏های بسیار مهم و معروف در حوزه یادگیری ماشین می‏باشد که در حجمی در حدود 6 ترابایت گرد‏آوری شده است. این داده‏ها، شامل عکس، فیلم، صدا و متن بوده و در قالب فایل‏‏‏هایی فشرده شده بصورت کاملاً رایگان در اختیار کاربران سامانه پردازش سریع شریف قرار می‏گیرد.

Name	Dataset URL	Size	Usage/Description	Type	Github sample code	Data path
HowTo100M	hpc.sharif.edu/test_2/share/apps/date/howto100m	~1.4 TB	A large-scale dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks with an explicit intention of explaining the visual content on screen	Video	https://github.com/antoine77340/video_feature_extractor	hpc.sharif.edu/test_2: /share/apps/data
COCO	hpc.sharif.edu/test_2:/share/apps/data/COCO	72 GB	A large-scale object detection, segmentation, and captioning dataset. COCO has several features: *Object segmentation Recognition in context Superpixel stuff segmentation 330K images (>200K labeled) 1.5 million object instances 80 object categories 91 stuff categories 5 captions per image **250,000 people with keypoints	Image	https://github.com/litinglin/swintrack	hpc.sharif.edu/test_2: /share/apps/data
MPII Human Pose	hpc.sharif.edu/test_2:/share/apps/data/mpii_human_pose	~450 GB	includes around 25K images containing over 40K people with annotated body joints. The images were systematically collected using an established taxonomy of every day human activities. Overall the dataset covers 410 human activities and each image is provided with an activity label. Each image was extracted from a YouTube video and provided with preceding and following un-annotated frames. In addition, for the test set we obtained richer annotations including body part occlusions and 3D torso and head orientations.	Image - Video	https://github.com/salinasJJ/Bbpose	hpc.sharif.edu/test_2: /share/apps/data
MIT Places Images	hpc.sharif.edu/test_2:/share/apps/data/mit_places_audio_images	1.6 TB	a new scene-centric database called Places, with 205 scene categories and 2.5 millions of images with a category label.	Image	https://github.com/wnhsu/ResDAVEnet-VQ	hpc.sharif.edu/test_2: /share/apps/data
MIT Places Audio 400K	hpc.sharif.edu/test_2:/share/apps/data/mit_places_audio_images	85 GB	The following speech corpora were collected to investigate the learning of spoken language (words, sub-word units, higher-level semantics, etc.) from visually-grounded speech.	Audio	https://github.com/wnhsu/ResDAVEnet-VQ	hpc.sharif.edu/test_2: /share/apps/data
Spoken COCO 600K	hpc.sharif.edu/test_2:/share/apps/data/Object-Net	64 GB	SpokenCOCO (English) 600k contains approximately 600,000 recordings of human speakers reading the MSCOCO image captions out loud (in English). Each MSCOCO caption is read once	Audio	https://github.com/bhigy/zr-2021vg_baseline	hpc.sharif.edu/test_2: /share/apps/data
Spoken ObjectNet	hpc.sharif.edu/test_2:/share/apps/data/Object-Net	220 GB	a corpus of 50,273 English spoken audio captions for the images in the ObjectNet dataset.	Image - Audio	https://github.com/iapalm/Spoken-ObjectNet	hpc.sharif.edu/test_2: /share/apps/data
Zero Speech	hpc.sharif.edu/test_2:/share/apps/data/zero_speech	125 GB	The ultimate goal of the “Zero Resource Speech Challenge” f1 is to construct a system that learn an end-to-end Spoken Dialog (SD) system, in an unknown language, from scratch, using only raw sensory information available to an early language learner.	Text - Audio	https://github.com/bhigy/zr-2021vg_baseline	hpc.sharif.edu/test_2: /share/apps/data
LIReC	hpc.sharif.edu/test_2:/share/apps/data/lirec	80 GB	Learning Interactions and Relationships between Movie Characters	Video captions	https://github.com/Annusha/LIReC	hpc.sharif.edu/test_2: /share/apps/data
UCF101	hpc.sharif.edu/test_2:/share/apps/data/UCF101	31 GB	Action Recognition Data Set	Video	https://github.com/SwinTransformer/Video-Swin-Transformer	hpc.sharif.edu/test_2: /share/apps/data

مرکز پردازش سریع دانشگاه صنعتی شریف