[00:12] *** paul2520 has joined #wikiteam
[03:41] <nico_32> JAA: http://109.190.103.130/wiki_dump/wikidystifycom-20200827-history.xml.7z & http://109.190.103.130/wiki_dump/wikidystifycom-20200827-wikidump.7z
[06:29] <nico_32> https://github.com/WikiTeam/wikiteam/pull/394
[06:30] <nico_32> fixed the scraper
[06:50] <nico_32> Nemo_bis: https://travis-ci.org/github/WikiTeam/wikiteam/jobs/721928138 
[06:50] <nico_32> :)
[07:23] <Nemo_bis> Oh my, Special:Allpages
[07:24] <Nemo_bis> nico_32: remember to use that judiciously, it's kinda expensive on larger wikis
[07:24] <nico_32> i try to only use the api
[07:26] <Nemo_bis> nico_32: I imagine, just saying to be sure
[07:26] <Nemo_bis> Someone almost brought down the English Wikipedia once with Special:Allpages scraping :o
[07:28] <nico_32> i see how it can happens
[07:28] <nico_32> i only worked on the scraper because the CI was failing
[07:29] <Nemo_bis> yes, thank you very much for that!
[07:29] <Nemo_bis> we should make sure our tests don't bring down some wikis accidentally, but I don't know how
[07:30] <nico_32> the test is against archiveteam' wiki
[07:31] <nico_32> we could start a wiki in the CI and test against localhost
[07:33] <JAA> Thanks nico_32! :-)
[07:34] <JAA> I just threw them into AB for now, not sure how we can get these into the item on IA or if we even want that.
[07:34] <JAA> The AT wiki is already kind of at its limits, so yeah, please not.
[07:35] <nico_32> this dump contains image & desc
[07:36] <nico_32> i believe that the admin of a collection can edit item 
[07:50] <Nemo_bis> Better not mix HTML and XML dumps
[07:51] <nico_32> FR
[07:51] <nico_32> 4
[07:52] <nico_32> https://archive.org/details/wiki-wikidystifycom was done with dumpgenerator.py
[07:52] <nico_32> but it is missing all image
[09:37] *** Ryz has quit IRC (Quit: Ping timeout (120 seconds))
[13:24] *** fishbone has joined #wikiteam
[13:25] *** Jon- is now known as Jon
[13:26] <fishbone> hi guys. i need help with dumpgenerator.py
[13:28] <fishbone> for the past two days i was using it to capture a public wikipedia project, making a break for 1+ hour every time i was starting to get 503'd by the server and using "--resume" to continue
[13:30] <fishbone> it was working fine, but the last time i had a break it was in the middle of dumping "Category:*" pages. when i started the process again, it decided against continuing to dump "Category:*" titles, and went to download images instead
[13:31] <fishbone> it did not download any images previously (i was always using the "--images" argument), and now it has downloaded all of them and congratulated me on finishing the dump
[13:31] <fishbone> but as far as i can say, the "Category:*" pages remained incomplete, and whatever else was scheduled for dumping after them is probably missing too
[13:32] <fishbone> can you advice me on how do i check if the dump is complete, and in case it's not, how do i get everything that's missing?
[14:15] <Nemo_bis> fishbone: why would you dump a Wikipedia? 
[14:15] <Nemo_bis> We encourage using the official dumps when available https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps
[14:16] <Nemo_bis> fishbone: if you mean a non-Wikimedia MediaWiki wiki (yes, I know it's confusing: https://www.mediawiki.org/wiki/Wikipmediawiki ), thanks for trying. Please share the URL, as some issues can be webserver-specific.
[14:31] <fishbone> Nemo_bis: i was in a hurry and as far as i understood at the time, the dumps available were either a week or a month old, and i wanted to capture the latest version online
[14:32] <JAA> fishbone: Is this about the Scots Wikipedia?
[14:32] <fishbone> yup
[14:33] <fishbone> do you have a flood of new users asking the same questions because of it?
[14:33] <JAA> Yeah, last dump was on the 20th, and the last dump of page edit history was on the 1st.
[14:33] <JAA> No, but I figured that's the only Wikipedia project that would be interesting to grab at the moment.
[14:34] <fishbone> with how things are i wouldn't be surprised the scots wiki gets the full wipe due to "machine learning models considerations" or something like that
[14:34] <fishbone> so i really wanted to have a copy 
[14:36] <JAA> If so, I'd assume they'd run a final dump before doing so. But I don't know anything about WMF-internal politics, so not sure.
[14:36] <fishbone> better safe than sorry
[14:39] <JAA> I remember something about a way to get direct access to a read-only copy of the WMF databases, which would be much more efficient. Nemo_bis would probably know more.
[14:40] <fishbone> i would welcome any kind of advice, thanks
[14:41] <fishbone> although i think i was pretty close to finishing, so if there's an easy way to complete the process it'd be great too
[14:46] <phuzion> Has there even been any discussion about scots wikipedia on meta or anything like that? I haven't been following the case too closely.
[14:47] <phuzion> Turns out there is quite the RFC. https://meta.wikimedia.org/wiki/Requests_for_comment/Large_scale_language_inaccuracies_on_the_Scots_Wikipedia
[14:48] <fishbone> i've been periodically saving this one as well :)
[14:48] <fishbone> along with the reddit thread
[14:49] <fishbone> https://pastebin.com/qkBYHs6z here's the final portion of the log with most of the "Downloaded N images" strings omitted if it helps
[14:50] <fishbone> it gave me another 503 in the middle of downloading images so the final number is different from the total as i had to restart again
[14:58] <Nemo_bis> fishbone: sorry but there is no point archiving sco.wikipedia.org: 1) there is no significant activity in terms of editing or deleting https://sco.wikipedia.org/wiki/Special:Log/delete , 2) you may not finish earlier than the next dump, 3) local files are irrelevant, already stored elsewhere, 4) nothing is happening to this wiki, but even if it did we always archive dumps when a wiki is closed or 
[14:58] <Nemo_bis> deleted
[14:59] <Nemo_bis> Archiving the RfC is even more useless because Meta-Wiki is extremely lenient and the chances of anything getting deleted over there are non-existing (unless people are engaging in blatant libel or something, but would you want to distribute that?)
[15:01] <Nemo_bis> It wouldn't be bad to make an item on archive.org with historical official dumps of scowiki though
[15:02] <Nemo_bis> The media tarballs were not updated after 2012, of course https://archive.org/search.php?query=scowiki&and[]=collection%3A%22wikimedia-mediatar%22
[15:05] <Nemo_bis> But for that I'd just recommend to archive https://download.kiwix.org/zim/wikipedia/wikipedia_sco_all_maxi_2020-08.zim
[15:06] <Nemo_bis> On the other hand, if someone has plenty of bandwidth and disk space at hand, I would gladly accept help to update https://archive.org/details/wikimediacommons
[15:25] <fishbone> Nemo_bis: thank you for the thorough reply, i'll have to look into it i guess - i have a feeling that things online started to get the axe way more frequently these days, sometimes with no prior notice where one would be expected
[15:26] <fishbone> youtube quietly allowing certain channels to replace the videos while keeping the url, views, comments etc. comes to mind, but there's plenty of examples
[15:27] <fishbone> i like the story of scots wiki, so any blatant libel and extra drama is part of the story
[15:29] *** systwi_ has joined #wikiteam
[15:30] <fishbone> i feel like i'm gonna fail to do anything actually useful, but if you have a link that explains what one needs to do to update https://archive.org/details/wikimediacommons i'm going to read it to see if i can help
[15:34] *** systwi has quit IRC (Read error: Operation timed out)
[15:55] <Nemo_bis> fishbone: Wikimedia and Google are pretty much the opposite, Wikimedia Foundation is nearly all free software and open data
[15:55] <Nemo_bis> fishbone: I'm afraid we only have the code as explanation https://github.com/WikiTeam/wikiteam/tree/master/wikimediacommons https://wikitech.wikimedia.org/wiki/Nova_Resource:Dumps
[15:56] <Nemo_bis> fishbone: Making an item to collect existing dumps and archives of sco.wikipedia.org is something you can easily do (even manually) and would be useful
[15:57] <Nemo_bis> It will most likely be less than 5 GB even if you include several years
[15:57] <Nemo_bis> Then you can advertise the torrent on Reddit for your 15 min of glory :)
[15:57] *** Ryz has joined #wikiteam
[16:35] <fishbone> honestly, having to post on reddit sounds like something i would use to discourage people :)) i'll check out your links, but please don't take it as a commitment 
[16:35] <fishbone> i move around a lot and also i'm, uuh, averse to work
[16:35] <fishbone> have a nice friday evening everybody, i'm off for the day
[16:35] <fishbone> thanks again
[16:36] *** fishbone has quit IRC (Quit: Leaving...)
[16:39] <Nemo_bis> Unbelievable, someone whose work cannot be bought with internet points! ;-)
[17:44] <JAA> Nemo_bis: The official dump files are getting uploaded continuously to IA already as far as I can see. https://archive.org/search.php?query=scowiki&sort=-publicdate
[17:47] <JAA> Or did you mean dumps from 2005 to 2011?
[17:48] <JAA> Nope, there are also plenty of those (was sorting by the wrong date at first): https://archive.org/search.php?query=scowiki&sort=date
[17:49] <JAA> Wikimedia Commons weren't updated since September 2016, am I seeing that right? I was under the impression that was still actively mirrored. Oof.
[18:02] <Nemo_bis> JAA: Where "actively" means "whenever Nemo_bis gives up on waiting for someone else to step up"
[18:02] <JAA> Ah, I thought someone had built some automated system that needs near-zero maintenance. Although this is AT, so obviously not. :-P
[18:03] <Nemo_bis> Yes, there are many incremental dumps from across the years but these are all rather recent and most people won't have much use for them. So it can be interesting to have the most recent full dumps and something older from various places
[18:03] <Nemo_bis> There's nothing automated because we cannot afford keeping big enough hard disks online all the time
[18:04] <JAA> Looks like a couple hundred GiB for each daily dump. Is that new uploads per day?!
[18:04] <Nemo_bis> Yes. Some can be much bigger though
[18:04] <JAA> Damn.
[18:05] <Nemo_bis> And yes, ideally we'd find some smarter way to partition them.
[18:05] <JAA> Should be easy to split it into hourly dumps, no?
[18:07] <Nemo_bis> I guess so. Although that's no guarantee either, when someone starts uploading millions of TIFF maps of innumerable megapixels
[18:07] <JAA> I'm getting a vague feeling that we've had this discussion before.
[18:07] <Nemo_bis> Somewhat unhelpfully, some users are now mirroring nearly all Internet Archive books on Wikimedia Commons, so this collection might need to be reassessed in the future
[18:07] <Nemo_bis> Yes probably. :)
[18:08] <JAA> Hah
[18:08] <JAA> EFNet/#wikiteam 2020-01-01 17:46:10 UTC  <@JAA> Nemo_bis: I just noticed that https://archive.org/details/wikimediacommons?sort=-publicdate hasn't been updated since 2016. How come? ...
[18:08] <JAA> And it wasn't even that long ago.