Suivre
Pedro Ortiz Suarez
Pedro Ortiz Suarez
Autres nomsPedro Javier Ortiz Suárez
Senior Research Scientist, Common Crawl Foundation
Adresse e-mail validée de commoncrawl.org - Page d'accueil
Titre
Citée par
Citée par
Année
Bloom: A 176b-parameter open-access multilingual language model
T Le Scao, A Fan, C Akiki, E Pavlick, S Ilić, D Hesslow, R Castagné, ...
13552023
CamemBERT: a Tasty French Language Model
L Martin, B Muller, PJ Ortiz Suárez, Y Dupont, L Romary, ...
Proceedings of the 58th Annual Meeting of the Association for Computational …, 2020
11462020
Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures
PJ Ortiz Suárez, B Sagot, L Romary
7th Workshop on the Challenges in the Management of Large Corpora, 2019
435*2019
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
PJ Ortiz Suárez, L Romary, B Sagot
Proceedings of the 58th Annual Meeting of the Association for Computational …, 2020
219*2020
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
J Kreutzer, I Caswell, L Wang, A Wahab, D van Esch, N Ulzii-Orshikh, ...
Transactions of the Association for Computational Linguistics 10, 50-72, 2022
203*2022
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv eprints, page
J Abadji, P Ortiz Suarez, L Romary, B Sagot
arXiv preprint arXiv:2201.06642, 2022
1412022
The bigscience roots corpus: A 1.6 tb composite multilingual dataset
H Laurençon, L Saulnier, T Wang, C Akiki, A Villanova del Moral, ...
Advances in Neural Information Processing Systems 35, 31809-31826, 2022
1332022
Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus
J Abadji, PJO Suárez, L Romary, B Sagot
CMLC 2021-9th Workshop on Challenges in the Management of Large Corpora, 2021
552021
Building a user-generated content north-african arabizi treebank: Tackling hell
D Seddah, F Essaidi, A Fethi, M Futeral, B Muller, PJ Ortiz Suárez, ...
Proceedings of the 58th Annual Meeting of the Association for Computational …, 2020
462020
Quality at a glance: An audit of web-crawled multilingual datasets
I Caswell, J Kreutzer, L Wang, A Wahab, D van Esch, N Ulzii-Orshikh, ...
arXiv e-prints, arXiv: 2103.12028, 2021
332021
Establishing a New State-of-the-Art for French Named Entity Recognition
PJ Ortiz Suárez, Y Dupont, B Muller, L Romary, B Sagot
Proceedings of The 12th Language Resources and Evaluation Conference, 4631–4638, 2020
24*2020
From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
S Gabay, P Ortiz Suarez, A Bartz, A Chagué, R Bawden, P Gambette, ...
arXiv preprint arXiv:2202.09452, 2022
162022
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
A McMillan-Major, Z Alyafeai, S Biderman, K Chen, F De Toni, G Dupont, ...
arXiv preprint arXiv:2201.10066, 2022
132022
Les modèles de langue contextuels Camembert pour le français: impact de la taille et de l'hétérogénéité des données d'entrainement
L Martin, B Muller, PJ Ortiz Suárez, Y Dupont, L Romary, E Clergerie, ...
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP …, 2020
112020
Automatic extraction of materials and properties from superconductors scientific literature
L Foppiano, PB Castro, P Ortiz Suarez, K Terashima, Y Takano, M Ishii
Science and Technology of Advanced Materials: Methods 3 (1), 2153633, 2023
102023
Perplexed by quality: A perplexity-based method for adult and harmful content detection in multilingual heterogeneous web data
T Jansen, Y Tong, V Zevallos, PO Suarez
arXiv preprint arXiv:2212.10440, 2022
102022
Bertrade: Using contextual embeddings to parse old french
L Grobol, M Regnault, PO Suarez, B Sagot, L Romary, B Crabbé
13th Language Resources and Evaluation Conference, 2022
82022
Tokenizer Choice For LLM Training: Negligible or Crucial?
M Ali, M Fromm, K Thellmann, R Rutmann, M Lübbering, J Leveling, ...
arXiv preprint arXiv:2310.08754, 2023
72023
SinNer@CLEF-HIPE2020: Sinful Adaptation of SotA models for Named Entity Recognition in Historical French and German Newspapers
PJ Ortiz Suárez, Y Dupont, G Lejeune, T Tian
CLEF 2020 Working Notes 2696, 2020
6*2020
French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus
M Popa-Fabre, PJ Ortiz Suárez, B Sagot, ÉV de la Clergerie
Proceedings of the 8th Workshop on Challenges in the Management of Large …, 2020
32020
Le système ne peut pas réaliser cette opération maintenant. Veuillez réessayer plus tard.
Articles 1–20