[01:03] *** vitzli has joined #wikiteam [01:04] *** vitzli has quit IRC (Remote host closed the connection) [01:19] *** Wingy has quit IRC (Read error: Operation timed out) [01:22] *** chfoo has quit IRC (Read error: Operation timed out) [01:22] *** luckcolor has quit IRC (se.hub efnet.portlane.se) [01:25] *** chfoo has joined #wikiteam [01:28] *** MrRadar2 has quit IRC (Read error: Operation timed out) [01:29] *** kiska has quit IRC (Read error: Connection reset by peer) [01:29] *** systwi has quit IRC (Ping timeout: 622 seconds) [01:30] *** systwi has joined #wikiteam [01:31] *** balrog has quit IRC (Read error: Operation timed out) [01:31] *** astrid has quit IRC (Read error: Operation timed out) [01:31] *** Igloo has quit IRC (Read error: Operation timed out) [01:32] *** Iglooop1 has quit IRC (Read error: Operation timed out) [01:32] *** balrog has joined #wikiteam [01:32] *** Flashfire has quit IRC (Ping timeout: 276 seconds) [01:32] *** Zerote_ has quit IRC (Ping timeout: 276 seconds) [01:32] *** Zerote has joined #wikiteam [01:32] *** MrRadar has quit IRC (Read error: Operation timed out) [01:32] *** arkiver has quit IRC (Read error: Operation timed out) [01:33] *** VADemon has joined #wikiteam [01:33] *** benjins has quit IRC (Read error: Operation timed out) [01:33] *** chfoo has quit IRC (Read error: Operation timed out) [01:33] *** balrog has quit IRC (Remote host closed the connection) [01:34] *** arkiver has joined #wikiteam [01:34] *** balrog has joined #wikiteam [01:35] *** svchfoo1 sets mode: +o arkiver [01:35] *** svchfoo3 sets mode: +o arkiver [01:36] *** atphoenix has quit IRC (Read error: Operation timed out) [01:37] *** VADemon_ has quit IRC (Read error: Operation timed out) [01:37] *** systwi_ has joined #wikiteam [01:38] *** systwi_ has quit IRC (Read error: Connection reset by peer) [01:41] *** systwi has quit IRC (Ping timeout: 622 seconds) [01:42] *** Zerote has quit IRC (Ping timeout: 622 seconds) [01:44] *** systwi has joined #wikiteam [01:48] *** systwi_ has joined #wikiteam [01:54] *** systwi has quit IRC (Ping timeout: 622 seconds) [01:59] *** systwi_ has quit IRC (Ping timeout: 622 seconds) [02:10] *** systwi_ has joined #wikiteam [02:12] *** systwi_ has quit IRC (Read error: Connection reset by peer) [02:13] *** systwi_ has joined #wikiteam [02:14] *** MrRadar2 has joined #wikiteam [02:14] *** Igloo has joined #wikiteam [02:15] *** chfoo has joined #wikiteam [02:15] *** systwi_ has quit IRC (Read error: Connection reset by peer) [02:15] *** MrRadar has joined #wikiteam [02:16] *** systwi_ has joined #wikiteam [02:16] *** Iglooop1 has joined #wikiteam [02:17] *** svchfoo1 sets mode: +o Iglooop1 [02:17] *** svchfoo3 sets mode: +o Iglooop1 [02:18] *** MrRadar has quit IRC (Write error: Broken pipe) [02:18] *** Igloo has quit IRC (Read error: Operation timed out) [02:18] *** Iglooop1 has quit IRC (Read error: Operation timed out) [02:19] *** chfoo has quit IRC (Read error: Operation timed out) [02:24] *** yano_ is now known as yano [02:26] *** chfoo has joined #wikiteam [02:38] *** Iglooop1 has joined #wikiteam [02:39] *** svchfoo1 sets mode: +o Iglooop1 [02:39] *** svchfoo3 sets mode: +o Iglooop1 [02:39] *** astrid has joined #wikiteam [02:40] *** Iglooop1 sets mode: +o astrid [02:41] *** systwi_ has quit IRC (Ping timeout: 622 seconds) [02:41] *** Igloo has joined #wikiteam [02:42] *** chfoo has quit IRC (Ping timeout: 622 seconds) [02:42] *** MrRadar has joined #wikiteam [03:04] *** systwi has joined #wikiteam [03:08] *** systwi_ has joined #wikiteam [03:14] *** systwi has quit IRC (Ping timeout: 622 seconds) [03:17] *** systwi has joined #wikiteam [03:19] *** systwi_ has quit IRC (Ping timeout: 622 seconds) [03:24] *** systwi_ has joined #wikiteam [03:29] *** systwi has quit IRC (Ping timeout: 622 seconds) [03:35] *** systwi__ has joined #wikiteam [03:36] *** systwi_ has quit IRC (Ping timeout: 622 seconds) [05:39] *** kiska has joined #wikiteam [05:39] *** Iglooop1 sets mode: +o kiska [06:19] *** chfoo has joined #wikiteam [06:25] *** systwi__ is now known as systwi [12:36] *** astrid has quit IRC (Read error: Operation timed out) [12:44] *** astrid has joined #wikiteam [12:44] *** Iglooop1 sets mode: +o astrid [16:43] *** kiska18 has quit IRC (Remote host closed the connection) [16:44] *** kiska18 has joined #wikiteam [16:44] *** Iglooop1 sets mode: +o kiska18 [17:25] *** VADemon has quit IRC (Quit: left4dead) [17:46] Nemo_bis: I just noticed that https://archive.org/details/wikimediacommons?sort=-publicdate hasn't been updated since 2016. How come? Is there another dataset of the Commons data that I missed? [18:24] JAA: no [18:27] I just got sick of doing it without some proper hardware; the last time I did it with some 4 TB disks over a Gigabit connection at my university office [18:28] *** Wingy has joined #wikiteam [18:29] Ah [18:31] Sounds like better tooling is needed to prevent keeping a full copy on local storage? [18:32] not really unless we want to change format [18:32] there is some merit in the simplicity of the daily ZIP files [18:34] mostly, one could save a lot of time with better error handling [18:34] Hmm, but the daily size are only a couple hundred GB, so why did you need multiple 4 TB disks? [18:35] To increase concurrency and handle resumes [18:35] Or did you not upload continuously? [18:35] Otherwise it takes ages [18:35] Hmm [18:36] I usually uploaded batches of at least 6-12 months at least 6 months after the latest [18:36] Right [18:36] Yeah, then it takes a large amount of data obviously. [18:36] storage* [18:37] With some server closer to both IA and a WMF upload cache, it might be easier to saturate a gigabit connection without concurrency [18:37] Then the bottleneck would be the I/O speed (how quick you can write and zip) [18:38] From Europe the bottleneck is invariably networking [18:38] Without concurrency, you can just write to ZIP directly I guess. [18:39] (Or if ZIP doesn't support that, .tar.gz, which would be a minimal format change.) [18:42] That would at least get rid of half the I/O. [18:42] .tar.gz is not recommended, it would be unusable at those sizes without seeking [18:42] you can write directly to ZIP, yes, but then you need to handle network failures [18:43] if you download everything and then ZIP, you can just let wget handle the retries and continuation [18:43] Well yeah, accessing it would be a pain, that's true. [18:43] Right [18:44] Is the code you used for this back then available somewhere? [18:48] yes, it's all in our usual repo [18:48] very basic https://github.com/WikiTeam/wikiteam/blob/master/wikimediacommons/commonsdownloader.py [18:49] it's easy to get access to the Wikimedia DB replica, but if you want I can run https://github.com/WikiTeam/wikiteam/blob/master/wikimediacommons/commonssql.py for you [18:54] Ah, of course it is. :-) [18:59] That's this I assume? https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database [19:18] *** Wingy has quit IRC (The Lounge - https://thelounge.chat) [19:43] JAA: yes but you don't need to know any of that, it's just "sql commonswiki" once you login https://wikitech.wikimedia.org/wiki/Help:MySQL_queries#Accessing_the_databases [19:45] Ah, sweet. [19:48] Sounds good. I don't have time to do anything about this anytime soon, but I guess that's all information needed to in principle resume that archival. [21:28] *** Zerote has joined #wikiteam [21:41] *** benjins has joined #wikiteam [21:50] JAA: yes, so far most of the people involved were wikimedians so some information is either taken for granted or documented at https://wikitech.wikimedia.org/wiki/Nova_Resource:Dumps [21:53] nothing is especially complicated, just tedious [23:20] Nemo_bis: Right. I guess we should document it on the wikiteam page on AT wiki.