خدمات داده

یکی از مهمترین کاربردهای پردازش سریع، مباحث پیشرو در حوزه هوش مصنوعی می ‏باشد. در این پردازش‏ ها، عموماً یک مدل یادگیری در سطح ابتدایی پیاده‏ سازی شده و سپس با استفاده از داده‌‏های ورودی مناسب و مربوط، عملیات یادگیری مدل انجام می‏‌گیرد. مجموعه غنی‌ای از دیتایست شامل دیتاست‌‏های بسیار مهم و معروف در حوزه یادگیری ماشین می‏باشد که در حجمی در حدود 6 ترابایت گرد‏آوری شده است. این داده‌‏ها، شامل عکس، فیلم، صدا و متن بوده و در قالب فایل‌‌‏هایی فشرده شده بصورت کاملاً رایگان در اختیار کاربران سامانه پردازش سریع شریف قرار می‏گیرد.

Github sample code	Type	Usage/Description	Size	Dataset URL	Name
https://github.com/litinglin/swintrack	Image	A large-scale object detection, segmentation, and captioning dataset. COCO has several features: *Object segmentation Recognition in context Superpixel stuff segmentation 330K images (>200K labeled) 1.5 million object instances 80 object categories 91 stuff categories 5 captions per image **250,000 people with keypoints	72GB	share/apps/data/COCO/	COCO
https://github.com/antoine77340/video_feature_extractor	Video	A large-scale dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks with an explicit intention of explaining the visual content on screen	1.4TB	share/apps/date/howto100m/	HowTo100M
https://github.com/Annusha/LIReC	Video captions	Learning Interactions and Relationships between Movie Characters	80GB	share/apps/data/lirec/	LIReC
https://github.com/wnhsu/ResDAVEnet-VQ	Audio	The following speech corpora were collected to investigate the learning of spoken language (words, sub-word units, higher-level semantics, etc.) from visually-grounded speech.	85GB	share/apps/data/mit_places_audio_images/	MIT Places Audio 400K
https://github.com/wnhsu/ResDAVEnet-VQ	Image	a new scene-centric database called Places, with 205 scene categories and 2.5 millions of images with a category label.	1.6TB	share/apps/data/mit_places_audio_images/	MIT Places Images
https://github.com/salinasJJ/Bbpose	Image – Video	includes around 25K images containing over 40K people with annotated body joints. The images were systematically collected using an established taxonomy of every day human activities. Overall the dataset covers 410 human activities and each image is provided with an activity label. Each image was extracted from a YouTube video and provided with preceding and following un-annotated frames. In addition, for the test set we obtained richer annotations including body part occlusions and 3D torso and head orientations.	450GB	share/apps/data/mpii_human_pose/	MPII Human Pose
https://github.com/bhigy/zr-2021vg_baseline	Audio	SpokenCOCO (English) 600k contains approximately 600,000 recordings of human speakers reading the MSCOCO image captions out loud (in English). Each MSCOCO caption is read once	64GB	share/apps/data/Object-Net/	Spoken COCO 600K
https://github.com/iapalm/Spoken-ObjectNet	Image – Audio	a corpus of 50,273 English spoken audio captions for the images in the ObjectNet dataset.	220GB	share/apps/data/Object-Net/	Spoken ObjectNet
https://github.com/SwinTransformer/Video-Swin-Transformer	Video	Action Recognition Data Set	31GB	share/apps/data/UCF101/	UCF101
https://github.com/bhigy/zr-2021vg_baseline	Text – Audio	The ultimate goal of the “Zero Resource Speech Challenge” f1 is to construct a system that learn an end-to-end Spoken Dialog (SD) system, in an unknown language, from scratch, using only raw sensory information available to an early language learner.	125GB	share/apps/data/zero_speech/	Zero Speech

با خدمات داده بیشتر آشنا شوید

در این پردازش‏‌ها، عموماً یک مدل یادگیری در سطح ابتدایی پیاده‏سازی شده و سپس با استفاده از داده‏های ورودی مناسب و مربوط، عملیات یادگیری مدل انجام می‏گیرد. کیفیت و کمیت داده‏های استفاده شده در حین یادگیری، تاثیر مستقیمی با افزایش دقت مدل و کاهش سطح خطای آن دارد. مجموعه غنی پیش رو، شامل دیتاست‏های بسیار مهم و معروف در حوزه یادگیری ماشین می‏باشد که در حجمی در حدود 6 ترابایت گرد‏آوری شده است. این داده‏ها، شامل عکس، فیلم، صدا و متن بوده و در قالب فایل‏‏‏هایی فشرده شده بصورت کاملاً رایگان در اختیار کاربران سامانه پردازش سریع شریف قرار می‏گیرد.

نحوه استفاده از دیتاست‌ها

در هنگام اجرای کار پردازشی در کلاستر زمابندی با استفاده از ارجاع به مسیر قرار گیری دیتاست می‌توانید فایل‌ها را مورد استفاده قراردهید. دقت نمایید دسترسی شما به این دیتاست‌ها به صورت فقط خواندنی است. به طور مثال برای دیتاست COCO مسیر دسترسی به صورت زیر است

/share/apps/data/COCO

خدمات داده

با خدمات داده بیشتر آشنا شوید

دسترسی خدمات

پشتیبانی مرکز