#wikiteam 2020-08-28,Fri

↑back Search

Time Nickname Message
00:12 🔗 paul2520 has joined #wikiteam
03:41 🔗 nico_32 JAA: http://109.190.103.130/wiki_dump/wikidystifycom-20200827-history.xml.7z & http://109.190.103.130/wiki_dump/wikidystifycom-20200827-wikidump.7z
06:29 🔗 nico_32 https://github.com/WikiTeam/wikiteam/pull/394
06:30 🔗 nico_32 fixed the scraper
06:50 🔗 nico_32 Nemo_bis: https://travis-ci.org/github/WikiTeam/wikiteam/jobs/721928138
06:50 🔗 nico_32 :)
07:23 🔗 Nemo_bis Oh my, Special:Allpages
07:24 🔗 Nemo_bis nico_32: remember to use that judiciously, it's kinda expensive on larger wikis
07:24 🔗 nico_32 i try to only use the api
07:26 🔗 Nemo_bis nico_32: I imagine, just saying to be sure
07:26 🔗 Nemo_bis Someone almost brought down the English Wikipedia once with Special:Allpages scraping :o
07:28 🔗 nico_32 i see how it can happens
07:28 🔗 nico_32 i only worked on the scraper because the CI was failing
07:29 🔗 Nemo_bis yes, thank you very much for that!
07:29 🔗 Nemo_bis we should make sure our tests don't bring down some wikis accidentally, but I don't know how
07:30 🔗 nico_32 the test is against archiveteam' wiki
07:31 🔗 nico_32 we could start a wiki in the CI and test against localhost
07:33 🔗 JAA Thanks nico_32! :-)
07:34 🔗 JAA I just threw them into AB for now, not sure how we can get these into the item on IA or if we even want that.
07:34 🔗 JAA The AT wiki is already kind of at its limits, so yeah, please not.
07:35 🔗 nico_32 this dump contains image & desc
07:36 🔗 nico_32 i believe that the admin of a collection can edit item
07:50 🔗 Nemo_bis Better not mix HTML and XML dumps
07:51 🔗 nico_32 FR
07:51 🔗 nico_32 4
07:52 🔗 nico_32 https://archive.org/details/wiki-wikidystifycom was done with dumpgenerator.py
07:52 🔗 nico_32 but it is missing all image
09:37 🔗 Ryz has quit IRC (Quit: Ping timeout (120 seconds))
13:24 🔗 fishbone has joined #wikiteam
13:25 🔗 Jon- is now known as Jon
13:26 🔗 fishbone hi guys. i need help with dumpgenerator.py
13:28 🔗 fishbone for the past two days i was using it to capture a public wikipedia project, making a break for 1+ hour every time i was starting to get 503'd by the server and using "--resume" to continue
13:30 🔗 fishbone it was working fine, but the last time i had a break it was in the middle of dumping "Category:*" pages. when i started the process again, it decided against continuing to dump "Category:*" titles, and went to download images instead
13:31 🔗 fishbone it did not download any images previously (i was always using the "--images" argument), and now it has downloaded all of them and congratulated me on finishing the dump
13:31 🔗 fishbone but as far as i can say, the "Category:*" pages remained incomplete, and whatever else was scheduled for dumping after them is probably missing too
13:32 🔗 fishbone can you advice me on how do i check if the dump is complete, and in case it's not, how do i get everything that's missing?
14:15 🔗 Nemo_bis fishbone: why would you dump a Wikipedia?
14:15 🔗 Nemo_bis We encourage using the official dumps when available https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps
14:16 🔗 Nemo_bis fishbone: if you mean a non-Wikimedia MediaWiki wiki (yes, I know it's confusing: https://www.mediawiki.org/wiki/Wikipmediawiki ), thanks for trying. Please share the URL, as some issues can be webserver-specific.
14:31 🔗 fishbone Nemo_bis: i was in a hurry and as far as i understood at the time, the dumps available were either a week or a month old, and i wanted to capture the latest version online
14:32 🔗 JAA fishbone: Is this about the Scots Wikipedia?
14:32 🔗 fishbone yup
14:33 🔗 fishbone do you have a flood of new users asking the same questions because of it?
14:33 🔗 JAA Yeah, last dump was on the 20th, and the last dump of page edit history was on the 1st.
14:33 🔗 JAA No, but I figured that's the only Wikipedia project that would be interesting to grab at the moment.
14:34 🔗 fishbone with how things are i wouldn't be surprised the scots wiki gets the full wipe due to "machine learning models considerations" or something like that
14:34 🔗 fishbone so i really wanted to have a copy
14:36 🔗 JAA If so, I'd assume they'd run a final dump before doing so. But I don't know anything about WMF-internal politics, so not sure.
14:36 🔗 fishbone better safe than sorry
14:39 🔗 JAA I remember something about a way to get direct access to a read-only copy of the WMF databases, which would be much more efficient. Nemo_bis would probably know more.
14:40 🔗 fishbone i would welcome any kind of advice, thanks
14:41 🔗 fishbone although i think i was pretty close to finishing, so if there's an easy way to complete the process it'd be great too
14:46 🔗 phuzion Has there even been any discussion about scots wikipedia on meta or anything like that? I haven't been following the case too closely.
14:47 🔗 phuzion Turns out there is quite the RFC. https://meta.wikimedia.org/wiki/Requests_for_comment/Large_scale_language_inaccuracies_on_the_Scots_Wikipedia
14:48 🔗 fishbone i've been periodically saving this one as well :)
14:48 🔗 fishbone along with the reddit thread
14:49 🔗 fishbone https://pastebin.com/qkBYHs6z here's the final portion of the log with most of the "Downloaded N images" strings omitted if it helps
14:50 🔗 fishbone it gave me another 503 in the middle of downloading images so the final number is different from the total as i had to restart again
14:58 🔗 Nemo_bis fishbone: sorry but there is no point archiving sco.wikipedia.org: 1) there is no significant activity in terms of editing or deleting https://sco.wikipedia.org/wiki/Special:Log/delete , 2) you may not finish earlier than the next dump, 3) local files are irrelevant, already stored elsewhere, 4) nothing is happening to this wiki, but even if it did we always archive dumps when a wiki is closed or
14:58 🔗 Nemo_bis deleted
14:59 🔗 Nemo_bis Archiving the RfC is even more useless because Meta-Wiki is extremely lenient and the chances of anything getting deleted over there are non-existing (unless people are engaging in blatant libel or something, but would you want to distribute that?)
15:01 🔗 Nemo_bis It wouldn't be bad to make an item on archive.org with historical official dumps of scowiki though
15:02 🔗 Nemo_bis The media tarballs were not updated after 2012, of course https://archive.org/search.php?query=scowiki&and[]=collection%3A%22wikimedia-mediatar%22
15:05 🔗 Nemo_bis But for that I'd just recommend to archive https://download.kiwix.org/zim/wikipedia/wikipedia_sco_all_maxi_2020-08.zim
15:06 🔗 Nemo_bis On the other hand, if someone has plenty of bandwidth and disk space at hand, I would gladly accept help to update https://archive.org/details/wikimediacommons
15:25 🔗 fishbone Nemo_bis: thank you for the thorough reply, i'll have to look into it i guess - i have a feeling that things online started to get the axe way more frequently these days, sometimes with no prior notice where one would be expected
15:26 🔗 fishbone youtube quietly allowing certain channels to replace the videos while keeping the url, views, comments etc. comes to mind, but there's plenty of examples
15:27 🔗 fishbone i like the story of scots wiki, so any blatant libel and extra drama is part of the story
15:29 🔗 systwi_ has joined #wikiteam
15:30 🔗 fishbone i feel like i'm gonna fail to do anything actually useful, but if you have a link that explains what one needs to do to update https://archive.org/details/wikimediacommons i'm going to read it to see if i can help
15:34 🔗 systwi has quit IRC (Read error: Operation timed out)
15:55 🔗 Nemo_bis fishbone: Wikimedia and Google are pretty much the opposite, Wikimedia Foundation is nearly all free software and open data
15:55 🔗 Nemo_bis fishbone: I'm afraid we only have the code as explanation https://github.com/WikiTeam/wikiteam/tree/master/wikimediacommons https://wikitech.wikimedia.org/wiki/Nova_Resource:Dumps
15:56 🔗 Nemo_bis fishbone: Making an item to collect existing dumps and archives of sco.wikipedia.org is something you can easily do (even manually) and would be useful
15:57 🔗 Nemo_bis It will most likely be less than 5 GB even if you include several years
15:57 🔗 Nemo_bis Then you can advertise the torrent on Reddit for your 15 min of glory :)
15:57 🔗 Ryz has joined #wikiteam
16:35 🔗 fishbone honestly, having to post on reddit sounds like something i would use to discourage people :)) i'll check out your links, but please don't take it as a commitment
16:35 🔗 fishbone i move around a lot and also i'm, uuh, averse to work
16:35 🔗 fishbone have a nice friday evening everybody, i'm off for the day
16:35 🔗 fishbone thanks again
16:36 🔗 fishbone has quit IRC (Quit: Leaving...)
16:39 🔗 Nemo_bis Unbelievable, someone whose work cannot be bought with internet points! ;-)
17:44 🔗 JAA Nemo_bis: The official dump files are getting uploaded continuously to IA already as far as I can see. https://archive.org/search.php?query=scowiki&sort=-publicdate
17:47 🔗 JAA Or did you mean dumps from 2005 to 2011?
17:48 🔗 JAA Nope, there are also plenty of those (was sorting by the wrong date at first): https://archive.org/search.php?query=scowiki&sort=date
17:49 🔗 JAA Wikimedia Commons weren't updated since September 2016, am I seeing that right? I was under the impression that was still actively mirrored. Oof.
18:02 🔗 Nemo_bis JAA: Where "actively" means "whenever Nemo_bis gives up on waiting for someone else to step up"
18:02 🔗 JAA Ah, I thought someone had built some automated system that needs near-zero maintenance. Although this is AT, so obviously not. :-P
18:03 🔗 Nemo_bis Yes, there are many incremental dumps from across the years but these are all rather recent and most people won't have much use for them. So it can be interesting to have the most recent full dumps and something older from various places
18:03 🔗 Nemo_bis There's nothing automated because we cannot afford keeping big enough hard disks online all the time
18:04 🔗 JAA Looks like a couple hundred GiB for each daily dump. Is that new uploads per day?!
18:04 🔗 Nemo_bis Yes. Some can be much bigger though
18:04 🔗 JAA Damn.
18:05 🔗 Nemo_bis And yes, ideally we'd find some smarter way to partition them.
18:05 🔗 JAA Should be easy to split it into hourly dumps, no?
18:07 🔗 Nemo_bis I guess so. Although that's no guarantee either, when someone starts uploading millions of TIFF maps of innumerable megapixels
18:07 🔗 JAA I'm getting a vague feeling that we've had this discussion before.
18:07 🔗 Nemo_bis Somewhat unhelpfully, some users are now mirroring nearly all Internet Archive books on Wikimedia Commons, so this collection might need to be reassessed in the future
18:07 🔗 Nemo_bis Yes probably. :)
18:08 🔗 JAA Hah
18:08 🔗 JAA EFNet/#wikiteam 2020-01-01 17:46:10 UTC <@JAA> Nemo_bis: I just noticed that https://archive.org/details/wikimediacommons?sort=-publicdate hasn't been updated since 2016. How come? ...
18:08 🔗 JAA And it wasn't even that long ago.

irclogger-viewer