Time |
Nickname |
Message |
00:08
π
|
|
randomdes has quit IRC (Ping timeout: 268 seconds) |
00:10
π
|
|
randomdes has joined #wikiteam |
01:14
π
|
|
ta9le has quit IRC (Quit: Connection closed for inactivity) |
07:23
π
|
|
midas1 has quit IRC (Read error: Operation timed out) |
07:23
π
|
|
svchfoo3 has quit IRC (Read error: Operation timed out) |
07:24
π
|
|
midas1 has joined #wikiteam |
07:25
π
|
|
svchfoo3 has joined #wikiteam |
07:25
π
|
|
svchfoo1 sets mode: +o svchfoo3 |
11:24
π
|
|
svchfoo3 has quit IRC (Read error: Operation timed out) |
11:26
π
|
|
ta9le has joined #wikiteam |
11:27
π
|
|
svchfoo3 has joined #wikiteam |
11:28
π
|
|
svchfoo1 sets mode: +o svchfoo3 |
12:50
π
|
|
netchup has joined #wikiteam |
12:50
π
|
|
netchup has quit IRC (Client Quit) |
12:53
π
|
|
netchup has joined #wikiteam |
12:53
π
|
netchup |
hi |
12:54
π
|
|
netchupp has joined #wikiteam |
12:54
π
|
|
netchupp has quit IRC (Client Quit) |
12:55
π
|
netchup |
i am new here, dumps generated by dumpgenerator.py are visible via WB Machine ? |
12:58
π
|
netchup |
JAA: can we talk? uzerus here :) |
12:59
π
|
JAA |
Oh, hey netchup, long time no see. Your ArchiveBot job for those school websites is *still* running (almost seven months now). |
12:59
π
|
JAA |
I have no idea regarding this project, I'm just idling here. |
13:00
π
|
netchup |
ah, ok :) can you just give me the id of ArchiveBot job ? |
13:01
π
|
JAA |
netchup: That's a5l5nek576o746i75mvi27j31. It's pretty close to finishing though, should be done within August. |
13:04
π
|
netchup |
Nice to see that :) now i know i did good job, few k domains are saved :) but i cannot find all of them, did what i can |
13:07
π
|
Nemo_bis |
netchup: no, dumps need MediaWiki to be parsed into HTML |
13:09
π
|
JAA |
Nemo_bis: Was WikiTeam ever archiving in WARC format? |
13:10
π
|
Nemo_bis |
No, never |
13:10
π
|
Nemo_bis |
Although at some point someone in ArchiveTeam launched some warrior project for wikis (I have no idea what was in it or what came out of it) |
13:11
π
|
JAA |
Huh |
13:11
π
|
JAA |
I asked mainly because the AT wiki prominently says "WikiTeam (WARC format)" on the homepage. |
13:11
π
|
Nemo_bis |
Yeah, it could be that whoever put that job up never wrapped it up |
13:12
π
|
Nemo_bis |
It's been in the warrior for a long time (is it still?) |
13:12
π
|
Nemo_bis |
To download all the possible HTML representations that MediaWiki offers for the data in our dumps it would probably take billions of HTTP requests |
13:14
π
|
JAA |
I see. |
13:15
π
|
JAA |
Yeah, it would definitely be big. |
13:15
π
|
JAA |
And it would require careful ignores, e.g. Special:Log, Special:RecentChanges, Special:NewImages (I think that one's an extension), etc. |
13:16
π
|
Nemo_bis |
I?d say 5.2 billions at a minimum https://wikistats.wmflabs.org/ |
13:16
π
|
Nemo_bis |
Considering one per page (+ most recent history) and one per revision (+diff) |
13:17
π
|
JAA |
Yeah, though I'd say that the edit page is more useful than the diff one. |
13:18
π
|
Nemo_bis |
But the revisions are not usable without the diff |
13:18
π
|
JAA |
How so? |
13:18
π
|
Nemo_bis |
By looking at the final HTML you have no idea what the edit changed |
13:18
π
|
Nemo_bis |
Unless you run some diffing locally |
13:18
π
|
JAA |
Yeah, that's true, you'd have to run a diff afterwards. |
13:19
π
|
JAA |
We ignored diffs in ArchiveBot anyway because it doubles (approximately) the number of URLs that need to be retrieved. |
13:19
π
|
Nemo_bis |
Wikia alone would be 1G pages multiplied by some 2 MB (their HTML is incredibly bloated) uncompressed |
13:19
π
|
Nemo_bis |
While the XML is "only" 850 GiB https://archive.org/details/wikia_dump_20180602 |
13:21
π
|
Nemo_bis |
Speaking of which, a good use of a server with some hundreds GiB of disk would be to run again that Wikia archival with the latest WikiTeam code :) |
13:21
π
|
Nemo_bis |
Part of the latest archival was done with some bugs re API requests |
13:37
π
|
JAA |
arkiver: Do you know more about WikiTeam and WARCs? |
13:47
π
|
JAA |
Nemo_bis: That could be done in chunks, right? I.e. it wouldn't be necessary to store all 850 GiB at the same time. |
13:47
π
|
JAA |
As far as I can see, it's one dump file per wiki anyway. |
13:48
π
|
JAA |
So grab [0-9] wikis, compress, upload, delete, grab "a" wikis, compress, upload, delete, etc. |
14:11
π
|
Nemo_bis |
In theory yes, but given a wiki might take a totally random amount of space from 100 KB to 250 GB that's rarely useful. |
14:11
π
|
Nemo_bis |
Then I'm open to other people experimenting with their own methods. :) |
14:11
π
|
JAA |
An individual wiki, sure, but a large number of wikis should be fairly predictable (comparatively, at least). |
14:12
π
|
Nemo_bis |
dunno |
14:12
π
|
Nemo_bis |
I have no evidence of size being randomly distributed |
14:13
π
|
Nemo_bis |
With some effort it can be enough to have 300 GiB of disk or so |
14:13
π
|
Nemo_bis |
Personally I preferred to spend some dozens β¬ more on disk space and save myself hours of pointless work |
14:14
π
|
JAA |
Yeah, makes sense. |
14:15
π
|
JAA |
How were the Wikia wikis discovered, by the way? |
14:16
π
|
Nemo_bis |
There's an API to list them all |
14:16
π
|
JAA |
Ah, nice. |
14:16
π
|
Nemo_bis |
https://github.com/WikiTeam/wikiteam/blob/master/listsofwikis/mediawiki/wikia.py |
15:36
π
|
netchup |
how items are added to wiki warrior? |
15:43
π
|
arkiver |
<Nemo_bis> Although at some point someone in ArchiveTeam launched some warrior project for wikis (I have no idea what was in it or what came out of it) |
15:43
π
|
arkiver |
that was me |
15:43
π
|
arkiver |
we archived external links |
15:44
π
|
arkiver |
https://archive.org/details/archiveteam_wiki |
15:45
π
|
arkiver |
not much happened with the project, but IΒ΄m totally for getting it running again |
15:45
π
|
arkiver |
with more wikis to archive external URLs from and maybe archive all wikis themselves? |
15:46
π
|
arkiver |
If diffs only double the number of URLs, then we could get them anyway, since we donΒ΄t really have a deadline for this |
15:46
π
|
netchup |
arkiver: can i help you someway ? |
15:46
π
|
arkiver |
not sure |
15:47
π
|
arkiver |
Nemo_bis: JAA: are you fine with this being the Β΄officialΒ΄ channel of the warrior project too? |
15:47
π
|
arkiver |
IΒ΄ll try to get some stuff restarted today, code should be pretty much there already for wikimedias |
15:48
π
|
arkiver |
err |
15:48
π
|
arkiver |
mediawikis |
15:48
π
|
JAA |
Ah, only external links, I see. I'll add that to our wiki somewhere, since it's currently not clear at all what the "WARC format" is referring to. |
15:48
π
|
netchup |
fine, cos urlteam warrior has stopped, for now only newsgrabber is working |
15:49
π
|
arkiver |
JAA: IΒ΄d like to start archiving wikis themselves too |
15:49
π
|
* |
arkiver is afk for ~30 min |
15:49
π
|
netchup |
what about dumps ? can we use them for archiving wikis? |
15:50
π
|
netchup |
i mean the manual job that has been done |
15:51
π
|
netchup |
some of wikis probably cannot be accessible in WB, and are dead. Dumps we grabbed can help with that i think |
15:51
π
|
netchup |
(i am not a programmer, so if you can please clear that) |
15:52
π
|
JAA |
netchup: We won't fabricate data for the Wayback Machine. But someone could take the dumps, set them up again somewhere as a wiki, and then we could archive those. That would be distinct from the original wiki though. |
15:58
π
|
netchup |
for archiving wikis, you probably need a list of them |
15:59
π
|
netchup |
this is where i can help a little |
16:02
π
|
Nemo_bis |
arkiver: sure, there's no problem in using this channel :) |
16:03
π
|
Nemo_bis |
I'm not sure how the WARC collections pageviews are calculated, but 21M is not bad https://archive.org/details/archiveteam_wiki&tab=about |
16:04
π
|
Nemo_bis |
I wonder why they became vanishingly small in 2017 compared to 2016. Wayback machine superseding them with its own archives? |
16:04
π
|
Nemo_bis |
Or just bots. 99,99 % of views come from California (?) |
16:06
π
|
astrid |
strange |
16:34
π
|
arkiver |
IΒ΄m not totally sure about this, but all the data you get through the Wayback Machine first undergoes some processing to for example rewrite URLs |
16:34
π
|
arkiver |
That means someone using the Wayback Machine will not directly download something from the WARCs, but indirectly through IA servers. |
16:34
π
|
astrid |
right |
16:35
π
|
astrid |
it's processed to insert the javascript header thingy, among other stuff |
16:35
π
|
arkiver |
Yes |
16:35
π
|
arkiver |
So that would be where the 99.99% California downloads comes from |
16:36
π
|
astrid |
i'm seeing lots of california downloads also for non-warc collections |
16:36
π
|
astrid |
shrug emoji |
16:37
π
|
arkiver |
As for the drop in views, URLs closest to a requested timestamp are send to a user. IA has never stopped archiving, while we have not been working on these URLs anymore |
16:37
π
|
arkiver |
astrid: Possibly search related, IΒ΄ll ask at IA. |
16:37
π
|
astrid |
cool :) |
16:38
π
|
astrid |
i'm curious but not enough to bother someone |
20:03
π
|
|
balrog_ has joined #wikiteam |
20:09
π
|
|
balrog has quit IRC (Ping timeout: 960 seconds) |
20:09
π
|
|
balrog_ is now known as balrog |
22:05
π
|
|
netchup has quit IRC (Quit: http://www.mibbit.com ajax IRC Client) |
22:11
π
|
JAA |
https://wikiapiary.com/wiki/Websites "Total pages -8,982,570,193,088,097,861" |
22:11
π
|
JAA |
Nice |
22:11
π
|
JAA |
The total number of edits also seems slightly unrealistic: 552,816,861,540,660,473 |
22:20
π
|
JAA |
arkiver: Can you change the description of https://archive.org/details/archiveteam_wiki ? Since that's the archive of external links, not the wikis itself... |
22:20
π
|
JAA |
I'm updating our wiki page on WikiTeam right now. |
23:23
π
|
|
ta9le has quit IRC (Quit: Connection closed for inactivity) |