Time |
Nickname |
Message |
01:03
🔗
|
|
vitzli has joined #wikiteam |
01:04
🔗
|
|
vitzli has quit IRC (Remote host closed the connection) |
01:19
🔗
|
|
Wingy has quit IRC (Read error: Operation timed out) |
01:22
🔗
|
|
chfoo has quit IRC (Read error: Operation timed out) |
01:22
🔗
|
|
luckcolor has quit IRC (se.hub efnet.portlane.se) |
01:25
🔗
|
|
chfoo has joined #wikiteam |
01:28
🔗
|
|
MrRadar2 has quit IRC (Read error: Operation timed out) |
01:29
🔗
|
|
kiska has quit IRC (Read error: Connection reset by peer) |
01:29
🔗
|
|
systwi has quit IRC (Ping timeout: 622 seconds) |
01:30
🔗
|
|
systwi has joined #wikiteam |
01:31
🔗
|
|
balrog has quit IRC (Read error: Operation timed out) |
01:31
🔗
|
|
astrid has quit IRC (Read error: Operation timed out) |
01:31
🔗
|
|
Igloo has quit IRC (Read error: Operation timed out) |
01:32
🔗
|
|
Iglooop1 has quit IRC (Read error: Operation timed out) |
01:32
🔗
|
|
balrog has joined #wikiteam |
01:32
🔗
|
|
Flashfire has quit IRC (Ping timeout: 276 seconds) |
01:32
🔗
|
|
Zerote_ has quit IRC (Ping timeout: 276 seconds) |
01:32
🔗
|
|
Zerote has joined #wikiteam |
01:32
🔗
|
|
MrRadar has quit IRC (Read error: Operation timed out) |
01:32
🔗
|
|
arkiver has quit IRC (Read error: Operation timed out) |
01:33
🔗
|
|
VADemon has joined #wikiteam |
01:33
🔗
|
|
benjins has quit IRC (Read error: Operation timed out) |
01:33
🔗
|
|
chfoo has quit IRC (Read error: Operation timed out) |
01:33
🔗
|
|
balrog has quit IRC (Remote host closed the connection) |
01:34
🔗
|
|
arkiver has joined #wikiteam |
01:34
🔗
|
|
balrog has joined #wikiteam |
01:35
🔗
|
|
svchfoo1 sets mode: +o arkiver |
01:35
🔗
|
|
svchfoo3 sets mode: +o arkiver |
01:36
🔗
|
|
atphoenix has quit IRC (Read error: Operation timed out) |
01:37
🔗
|
|
VADemon_ has quit IRC (Read error: Operation timed out) |
01:37
🔗
|
|
systwi_ has joined #wikiteam |
01:38
🔗
|
|
systwi_ has quit IRC (Read error: Connection reset by peer) |
01:41
🔗
|
|
systwi has quit IRC (Ping timeout: 622 seconds) |
01:42
🔗
|
|
Zerote has quit IRC (Ping timeout: 622 seconds) |
01:44
🔗
|
|
systwi has joined #wikiteam |
01:48
🔗
|
|
systwi_ has joined #wikiteam |
01:54
🔗
|
|
systwi has quit IRC (Ping timeout: 622 seconds) |
01:59
🔗
|
|
systwi_ has quit IRC (Ping timeout: 622 seconds) |
02:10
🔗
|
|
systwi_ has joined #wikiteam |
02:12
🔗
|
|
systwi_ has quit IRC (Read error: Connection reset by peer) |
02:13
🔗
|
|
systwi_ has joined #wikiteam |
02:14
🔗
|
|
MrRadar2 has joined #wikiteam |
02:14
🔗
|
|
Igloo has joined #wikiteam |
02:15
🔗
|
|
chfoo has joined #wikiteam |
02:15
🔗
|
|
systwi_ has quit IRC (Read error: Connection reset by peer) |
02:15
🔗
|
|
MrRadar has joined #wikiteam |
02:16
🔗
|
|
systwi_ has joined #wikiteam |
02:16
🔗
|
|
Iglooop1 has joined #wikiteam |
02:17
🔗
|
|
svchfoo1 sets mode: +o Iglooop1 |
02:17
🔗
|
|
svchfoo3 sets mode: +o Iglooop1 |
02:18
🔗
|
|
MrRadar has quit IRC (Write error: Broken pipe) |
02:18
🔗
|
|
Igloo has quit IRC (Read error: Operation timed out) |
02:18
🔗
|
|
Iglooop1 has quit IRC (Read error: Operation timed out) |
02:19
🔗
|
|
chfoo has quit IRC (Read error: Operation timed out) |
02:24
🔗
|
|
yano_ is now known as yano |
02:26
🔗
|
|
chfoo has joined #wikiteam |
02:38
🔗
|
|
Iglooop1 has joined #wikiteam |
02:39
🔗
|
|
svchfoo1 sets mode: +o Iglooop1 |
02:39
🔗
|
|
svchfoo3 sets mode: +o Iglooop1 |
02:39
🔗
|
|
astrid has joined #wikiteam |
02:40
🔗
|
|
Iglooop1 sets mode: +o astrid |
02:41
🔗
|
|
systwi_ has quit IRC (Ping timeout: 622 seconds) |
02:41
🔗
|
|
Igloo has joined #wikiteam |
02:42
🔗
|
|
chfoo has quit IRC (Ping timeout: 622 seconds) |
02:42
🔗
|
|
MrRadar has joined #wikiteam |
03:04
🔗
|
|
systwi has joined #wikiteam |
03:08
🔗
|
|
systwi_ has joined #wikiteam |
03:14
🔗
|
|
systwi has quit IRC (Ping timeout: 622 seconds) |
03:17
🔗
|
|
systwi has joined #wikiteam |
03:19
🔗
|
|
systwi_ has quit IRC (Ping timeout: 622 seconds) |
03:24
🔗
|
|
systwi_ has joined #wikiteam |
03:29
🔗
|
|
systwi has quit IRC (Ping timeout: 622 seconds) |
03:35
🔗
|
|
systwi__ has joined #wikiteam |
03:36
🔗
|
|
systwi_ has quit IRC (Ping timeout: 622 seconds) |
05:39
🔗
|
|
kiska has joined #wikiteam |
05:39
🔗
|
|
Iglooop1 sets mode: +o kiska |
06:19
🔗
|
|
chfoo has joined #wikiteam |
06:25
🔗
|
|
systwi__ is now known as systwi |
12:36
🔗
|
|
astrid has quit IRC (Read error: Operation timed out) |
12:44
🔗
|
|
astrid has joined #wikiteam |
12:44
🔗
|
|
Iglooop1 sets mode: +o astrid |
16:43
🔗
|
|
kiska18 has quit IRC (Remote host closed the connection) |
16:44
🔗
|
|
kiska18 has joined #wikiteam |
16:44
🔗
|
|
Iglooop1 sets mode: +o kiska18 |
17:25
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
17:46
🔗
|
JAA |
Nemo_bis: I just noticed that https://archive.org/details/wikimediacommons?sort=-publicdate hasn't been updated since 2016. How come? Is there another dataset of the Commons data that I missed? |
18:24
🔗
|
Nemo_bis |
JAA: no |
18:27
🔗
|
Nemo_bis |
I just got sick of doing it without some proper hardware; the last time I did it with some 4 TB disks over a Gigabit connection at my university office |
18:28
🔗
|
|
Wingy has joined #wikiteam |
18:29
🔗
|
JAA |
Ah |
18:31
🔗
|
JAA |
Sounds like better tooling is needed to prevent keeping a full copy on local storage? |
18:32
🔗
|
Nemo_bis |
not really unless we want to change format |
18:32
🔗
|
Nemo_bis |
there is some merit in the simplicity of the daily ZIP files |
18:34
🔗
|
Nemo_bis |
mostly, one could save a lot of time with better error handling |
18:34
🔗
|
JAA |
Hmm, but the daily size are only a couple hundred GB, so why did you need multiple 4 TB disks? |
18:35
🔗
|
Nemo_bis |
To increase concurrency and handle resumes |
18:35
🔗
|
JAA |
Or did you not upload continuously? |
18:35
🔗
|
Nemo_bis |
Otherwise it takes ages |
18:35
🔗
|
JAA |
Hmm |
18:36
🔗
|
Nemo_bis |
I usually uploaded batches of at least 6-12 months at least 6 months after the latest |
18:36
🔗
|
JAA |
Right |
18:36
🔗
|
JAA |
Yeah, then it takes a large amount of data obviously. |
18:36
🔗
|
JAA |
storage* |
18:37
🔗
|
Nemo_bis |
With some server closer to both IA and a WMF upload cache, it might be easier to saturate a gigabit connection without concurrency |
18:37
🔗
|
Nemo_bis |
Then the bottleneck would be the I/O speed (how quick you can write and zip) |
18:38
🔗
|
Nemo_bis |
From Europe the bottleneck is invariably networking |
18:38
🔗
|
JAA |
Without concurrency, you can just write to ZIP directly I guess. |
18:39
🔗
|
JAA |
(Or if ZIP doesn't support that, .tar.gz, which would be a minimal format change.) |
18:42
🔗
|
JAA |
That would at least get rid of half the I/O. |
18:42
🔗
|
Nemo_bis |
.tar.gz is not recommended, it would be unusable at those sizes without seeking |
18:42
🔗
|
Nemo_bis |
you can write directly to ZIP, yes, but then you need to handle network failures |
18:43
🔗
|
Nemo_bis |
if you download everything and then ZIP, you can just let wget handle the retries and continuation |
18:43
🔗
|
JAA |
Well yeah, accessing it would be a pain, that's true. |
18:43
🔗
|
JAA |
Right |
18:44
🔗
|
JAA |
Is the code you used for this back then available somewhere? |
18:48
🔗
|
Nemo_bis |
yes, it's all in our usual repo |
18:48
🔗
|
Nemo_bis |
very basic https://github.com/WikiTeam/wikiteam/blob/master/wikimediacommons/commonsdownloader.py |
18:49
🔗
|
Nemo_bis |
it's easy to get access to the Wikimedia DB replica, but if you want I can run https://github.com/WikiTeam/wikiteam/blob/master/wikimediacommons/commonssql.py for you |
18:54
🔗
|
JAA |
Ah, of course it is. :-) |
18:59
🔗
|
JAA |
That's this I assume? https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database |
19:18
🔗
|
|
Wingy has quit IRC (The Lounge - https://thelounge.chat) |
19:43
🔗
|
Nemo_bis |
JAA: yes but you don't need to know any of that, it's just "sql commonswiki" once you login https://wikitech.wikimedia.org/wiki/Help:MySQL_queries#Accessing_the_databases |
19:45
🔗
|
JAA |
Ah, sweet. |
19:48
🔗
|
JAA |
Sounds good. I don't have time to do anything about this anytime soon, but I guess that's all information needed to in principle resume that archival. |
21:28
🔗
|
|
Zerote has joined #wikiteam |
21:41
🔗
|
|
benjins has joined #wikiteam |
21:50
🔗
|
Nemo_bis |
JAA: yes, so far most of the people involved were wikimedians so some information is either taken for granted or documented at https://wikitech.wikimedia.org/wiki/Nova_Resource:Dumps |
21:53
🔗
|
Nemo_bis |
nothing is especially complicated, just tedious |
23:20
🔗
|
JAA |
Nemo_bis: Right. I guess we should document it on the wikiteam page on AT wiki. |