VinBigdata shares 100-hour data for the community

Contributing to building a scientific playground for the Speech and Language Processing community in Vietnam, Vingroup Big Data Institute has shared two Vietnamese datasets, supporting VLSP to organize ASR challenge 2020.

One of the two datasets shared by VinBigdata is the speech corpus for the automatic speech recognition task in VLSP-2020. A small speech training dataset of 100 hours named as VinBigdata-VLSP2020-100h that was specially created for the Task-01 belongs to the international workshop VLSP-2020. Utterances were stored as audio files in the wave format with text files containing corresponding transcripts. The dataset includes two speech styles. One is reading speech (about 20 hours). Speakers were set up to read manually prepared transcripts using their smartphones in many environments. Topic of transcripts were news, stories, wiki, etc.Another is a spontaneous speech (about 80 hours) that was crawled from open sources and manually transcribed with an accuracy of 96%. You can download the speech corpus here.

The other one shared is English-Vietnamese Machine Translation. The Machine Translation shared-task includes only one track: text translation from English to Vietnamese in the NEWS domain. Training data consists of two corpora: Parallel corpora, which are in UTF-8 plaintexts, 1-to-1 sentence aligned, one sentence per line, and include in-domain NEWS dataset of size 20k samples with 80% in the training set, 10% in the dev set and 10% in the test set; and out-of-domain parallel datasets roughly of size 4M samples, such as openSub (3.5M), ted-like (55k), evbcorpus (45k), wiki-alt (20k), and basic (8.8k) datasets. Monolingual corpora, which are in the UTF-8 plaintext format, one “sentence” per line, and include 2M Vietnamese web crawling samples. The parallel corpora is now available here. You can also download the Monolingual corpora here.

This year, VLSP 2020 is expected to be held in December in Hanoi. Since 2012, the VLSP community has had annual activities to share the results of applied research, as well as tools and resources in the field of language processing, then plan development strategies for the community. Annual seminars attract hundreds of participants, nearly 5000 members join the Facebook forum of the VLSP community.

Relevant news

    Thank you for your interest.

    File hiện tại không thể tải xuống
    Vui lòng liên hệ hỗ trợ.

    VinOCR eKYC
    Chọn ảnh từ máy của bạn

    Chọn ảnh demo dưới đây hoặc tải ảnh lên từ máy của bạn

    Tải lên ảnh CMND/CCCD/Hộ chiếu,...

    your image
    Chọn ảnh khác
    Tiến hành xử lý
    Thông tin đã được xử lý
    Mức độ tin cậy: 0%
    • -
    • -
    • -
    • -
    • -
    • -
    • -
    • -
    • -
    • -
    • -
    • -
    • -
    Xác thực thông tin thẻ CMND/CCCD

    Vui lòng sử dụng giấy tờ thật. Hãy đảm bảo ảnh chụp không bị mờ hoặc bóng, thông tin hiển thị rõ ràng, dễ đọc.

    your image
    Chọn ảnh khác

    Ảnh mặt trước CMND/CCCD

    your image
    Chọn ảnh khác

    Ảnh mặt sau CMND/CCCD

    your image
    Chọn ảnh khác

    Ảnh chân dung

    This site is registered on wpml.org as a development site.