#wikiteam 2020-01-01,Wed

↑back Search

Time Nickname Message
01:03 🔗 vitzli has joined #wikiteam
01:04 🔗 vitzli has quit IRC (Remote host closed the connection)
01:19 🔗 Wingy has quit IRC (Read error: Operation timed out)
01:22 🔗 chfoo has quit IRC (Read error: Operation timed out)
01:22 🔗 luckcolor has quit IRC (se.hub efnet.portlane.se)
01:25 🔗 chfoo has joined #wikiteam
01:28 🔗 MrRadar2 has quit IRC (Read error: Operation timed out)
01:29 🔗 kiska has quit IRC (Read error: Connection reset by peer)
01:29 🔗 systwi has quit IRC (Ping timeout: 622 seconds)
01:30 🔗 systwi has joined #wikiteam
01:31 🔗 balrog has quit IRC (Read error: Operation timed out)
01:31 🔗 astrid has quit IRC (Read error: Operation timed out)
01:31 🔗 Igloo has quit IRC (Read error: Operation timed out)
01:32 🔗 Iglooop1 has quit IRC (Read error: Operation timed out)
01:32 🔗 balrog has joined #wikiteam
01:32 🔗 Flashfire has quit IRC (Ping timeout: 276 seconds)
01:32 🔗 Zerote_ has quit IRC (Ping timeout: 276 seconds)
01:32 🔗 Zerote has joined #wikiteam
01:32 🔗 MrRadar has quit IRC (Read error: Operation timed out)
01:32 🔗 arkiver has quit IRC (Read error: Operation timed out)
01:33 🔗 VADemon has joined #wikiteam
01:33 🔗 benjins has quit IRC (Read error: Operation timed out)
01:33 🔗 chfoo has quit IRC (Read error: Operation timed out)
01:33 🔗 balrog has quit IRC (Remote host closed the connection)
01:34 🔗 arkiver has joined #wikiteam
01:34 🔗 balrog has joined #wikiteam
01:35 🔗 svchfoo1 sets mode: +o arkiver
01:35 🔗 svchfoo3 sets mode: +o arkiver
01:36 🔗 atphoenix has quit IRC (Read error: Operation timed out)
01:37 🔗 VADemon_ has quit IRC (Read error: Operation timed out)
01:37 🔗 systwi_ has joined #wikiteam
01:38 🔗 systwi_ has quit IRC (Read error: Connection reset by peer)
01:41 🔗 systwi has quit IRC (Ping timeout: 622 seconds)
01:42 🔗 Zerote has quit IRC (Ping timeout: 622 seconds)
01:44 🔗 systwi has joined #wikiteam
01:48 🔗 systwi_ has joined #wikiteam
01:54 🔗 systwi has quit IRC (Ping timeout: 622 seconds)
01:59 🔗 systwi_ has quit IRC (Ping timeout: 622 seconds)
02:10 🔗 systwi_ has joined #wikiteam
02:12 🔗 systwi_ has quit IRC (Read error: Connection reset by peer)
02:13 🔗 systwi_ has joined #wikiteam
02:14 🔗 MrRadar2 has joined #wikiteam
02:14 🔗 Igloo has joined #wikiteam
02:15 🔗 chfoo has joined #wikiteam
02:15 🔗 systwi_ has quit IRC (Read error: Connection reset by peer)
02:15 🔗 MrRadar has joined #wikiteam
02:16 🔗 systwi_ has joined #wikiteam
02:16 🔗 Iglooop1 has joined #wikiteam
02:17 🔗 svchfoo1 sets mode: +o Iglooop1
02:17 🔗 svchfoo3 sets mode: +o Iglooop1
02:18 🔗 MrRadar has quit IRC (Write error: Broken pipe)
02:18 🔗 Igloo has quit IRC (Read error: Operation timed out)
02:18 🔗 Iglooop1 has quit IRC (Read error: Operation timed out)
02:19 🔗 chfoo has quit IRC (Read error: Operation timed out)
02:24 🔗 yano_ is now known as yano
02:26 🔗 chfoo has joined #wikiteam
02:38 🔗 Iglooop1 has joined #wikiteam
02:39 🔗 svchfoo1 sets mode: +o Iglooop1
02:39 🔗 svchfoo3 sets mode: +o Iglooop1
02:39 🔗 astrid has joined #wikiteam
02:40 🔗 Iglooop1 sets mode: +o astrid
02:41 🔗 systwi_ has quit IRC (Ping timeout: 622 seconds)
02:41 🔗 Igloo has joined #wikiteam
02:42 🔗 chfoo has quit IRC (Ping timeout: 622 seconds)
02:42 🔗 MrRadar has joined #wikiteam
03:04 🔗 systwi has joined #wikiteam
03:08 🔗 systwi_ has joined #wikiteam
03:14 🔗 systwi has quit IRC (Ping timeout: 622 seconds)
03:17 🔗 systwi has joined #wikiteam
03:19 🔗 systwi_ has quit IRC (Ping timeout: 622 seconds)
03:24 🔗 systwi_ has joined #wikiteam
03:29 🔗 systwi has quit IRC (Ping timeout: 622 seconds)
03:35 🔗 systwi__ has joined #wikiteam
03:36 🔗 systwi_ has quit IRC (Ping timeout: 622 seconds)
05:39 🔗 kiska has joined #wikiteam
05:39 🔗 Iglooop1 sets mode: +o kiska
06:19 🔗 chfoo has joined #wikiteam
06:25 🔗 systwi__ is now known as systwi
12:36 🔗 astrid has quit IRC (Read error: Operation timed out)
12:44 🔗 astrid has joined #wikiteam
12:44 🔗 Iglooop1 sets mode: +o astrid
16:43 🔗 kiska18 has quit IRC (Remote host closed the connection)
16:44 🔗 kiska18 has joined #wikiteam
16:44 🔗 Iglooop1 sets mode: +o kiska18
17:25 🔗 VADemon has quit IRC (Quit: left4dead)
17:46 🔗 JAA Nemo_bis: I just noticed that https://archive.org/details/wikimediacommons?sort=-publicdate hasn't been updated since 2016. How come? Is there another dataset of the Commons data that I missed?
18:24 🔗 Nemo_bis JAA: no
18:27 🔗 Nemo_bis I just got sick of doing it without some proper hardware; the last time I did it with some 4 TB disks over a Gigabit connection at my university office
18:28 🔗 Wingy has joined #wikiteam
18:29 🔗 JAA Ah
18:31 🔗 JAA Sounds like better tooling is needed to prevent keeping a full copy on local storage?
18:32 🔗 Nemo_bis not really unless we want to change format
18:32 🔗 Nemo_bis there is some merit in the simplicity of the daily ZIP files
18:34 🔗 Nemo_bis mostly, one could save a lot of time with better error handling
18:34 🔗 JAA Hmm, but the daily size are only a couple hundred GB, so why did you need multiple 4 TB disks?
18:35 🔗 Nemo_bis To increase concurrency and handle resumes
18:35 🔗 JAA Or did you not upload continuously?
18:35 🔗 Nemo_bis Otherwise it takes ages
18:35 🔗 JAA Hmm
18:36 🔗 Nemo_bis I usually uploaded batches of at least 6-12 months at least 6 months after the latest
18:36 🔗 JAA Right
18:36 🔗 JAA Yeah, then it takes a large amount of data obviously.
18:36 🔗 JAA storage*
18:37 🔗 Nemo_bis With some server closer to both IA and a WMF upload cache, it might be easier to saturate a gigabit connection without concurrency
18:37 🔗 Nemo_bis Then the bottleneck would be the I/O speed (how quick you can write and zip)
18:38 🔗 Nemo_bis From Europe the bottleneck is invariably networking
18:38 🔗 JAA Without concurrency, you can just write to ZIP directly I guess.
18:39 🔗 JAA (Or if ZIP doesn't support that, .tar.gz, which would be a minimal format change.)
18:42 🔗 JAA That would at least get rid of half the I/O.
18:42 🔗 Nemo_bis .tar.gz is not recommended, it would be unusable at those sizes without seeking
18:42 🔗 Nemo_bis you can write directly to ZIP, yes, but then you need to handle network failures
18:43 🔗 Nemo_bis if you download everything and then ZIP, you can just let wget handle the retries and continuation
18:43 🔗 JAA Well yeah, accessing it would be a pain, that's true.
18:43 🔗 JAA Right
18:44 🔗 JAA Is the code you used for this back then available somewhere?
18:48 🔗 Nemo_bis yes, it's all in our usual repo
18:48 🔗 Nemo_bis very basic https://github.com/WikiTeam/wikiteam/blob/master/wikimediacommons/commonsdownloader.py
18:49 🔗 Nemo_bis it's easy to get access to the Wikimedia DB replica, but if you want I can run https://github.com/WikiTeam/wikiteam/blob/master/wikimediacommons/commonssql.py for you
18:54 🔗 JAA Ah, of course it is. :-)
18:59 🔗 JAA That's this I assume? https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database
19:18 🔗 Wingy has quit IRC (The Lounge - https://thelounge.chat)
19:43 🔗 Nemo_bis JAA: yes but you don't need to know any of that, it's just "sql commonswiki" once you login https://wikitech.wikimedia.org/wiki/Help:MySQL_queries#Accessing_the_databases
19:45 🔗 JAA Ah, sweet.
19:48 🔗 JAA Sounds good. I don't have time to do anything about this anytime soon, but I guess that's all information needed to in principle resume that archival.
21:28 🔗 Zerote has joined #wikiteam
21:41 🔗 benjins has joined #wikiteam
21:50 🔗 Nemo_bis JAA: yes, so far most of the people involved were wikimedians so some information is either taken for granted or documented at https://wikitech.wikimedia.org/wiki/Nova_Resource:Dumps
21:53 🔗 Nemo_bis nothing is especially complicated, just tedious
23:20 🔗 JAA Nemo_bis: Right. I guess we should document it on the wikiteam page on AT wiki.

irclogger-viewer