​Welcome to the Child Language Corpus of Jordanian Arabic (JA)—the first large-scale, systematically compiled linguistic resource dedicated to documenting the spoken language of typically developing children in Jordan. This corpus represents a foundational step in Arabic language acquisition research, offering a rich and unprecedented dataset of natural child speech across regional, age, and gender lines.
Spanning a total of approximately 500,000 words, this corpus is based on over 500 recorded interviews with children aged 2 years and 6 months to 12 years. These interactions capture a diverse spectrum of everyday, spontaneous language use, reflecting the authentic voices of Jordanian children across urban, rural, and Bedouin communities. The corpus offers an inclusive and highly representative view of vernacular Jordanian Arabic (JA) in real-life contexts.

Each interview was carefully transcribed to mirror exactly how children pronounce words, preserving phonetic details and including markers for pauses and disfluencies. This attention to detail ensures that the corpus is not only a record of what children say but also how they say it—a critical resource for research in phonology, morphosyntax, discourse development, and beyond.

Many of the recorded sessions, especially those involving younger children, feature interactions with parents or caregivers, providing a naturalistic context for child-directed speech. These interactions offer valuable insights into turn-taking, scaffolding, and social-pragmatic development in early language acquisition.

Ethical standards were upheld throughout the project. Informed consent was obtained from all participating families, and data was anonymized to protect the privacy of the children and their families. The project was reviewed and approved in accordance with institutional ethical guidelines.

Key Features of the Corpus

Age Range: Children aged 2.6 to 12 years, covering key stages of early and late language development
Regional Diversity: Includes data from urban centers, rural areas, and Bedouin communities across Jordan
Search and Filter Options: Users can filter data based on region, age, and sex, enabling targeted investigations
Phonetic Transcription: Utterances are transcribed exactly as pronounced, maintaining critical phonological data
Context-Rich Interactions: Many interviews include natural caregiver-child dialogues, ideal for pragmatic and discourse studies
Why This Corpus Matters

This project is the first of its kind in the Arab world—a systematic, large-scale corpus that centers the voices of children in their everyday linguistic environments. Despite the centrality of Arabic in global linguistic diversity, child language acquisition in Arabic dialects has long been underrepresented in corpus-based research. This gap has limited our understanding of the developmental pathways unique to Arabic and of how children acquire complex syntactic and phonological features found in Arabic varieties.

The Child Language Corpus of Jordanian Arabic directly addresses this gap. It offers a robust empirical foundation for testing hypotheses in language development, morphosyntactic theory, dialect variation, and pragmatic competence. It also opens the door to cross-linguistic comparisons, shedding light on universal vs. language-specific features of acquisition.

Researchers in fields as diverse as developmental linguistics, language education, psycholinguistics, dialectology, speech-language pathology, and natural language processing (NLP) will find this corpus an invaluable tool. Moreover, the inclusion of vernacular, spoken Arabic—rather than Modern Standard Arabic—reflects a more accurate linguistic reality of children's day-to-day experiences and enhances the ecological validity of the research.

Looking Ahead

This corpus is more than a collection of interviews—it is a platform for collaboration, discovery, and innovation. As the database continues to expand and evolve, we welcome scholars and educators to explore, analyze, and build upon this resource. Together, we can deepen our understanding of how children acquire language in context and ensure that the linguistic experiences of Arabic-speaking children are fully represented in global academic discourse.

We hope this resource contributes meaningfully to the development of child language research in Arabic and inspires similar projects across the region and beyond.

Dr. Marwan Jarrah
Principal Investigator
Child Language Corpus of Jordanian Arabic
University of Jordan​