[00:01] *** Raccoon is now known as A-real-ni [00:01] *** A-real-ni is now known as Raccoon [00:01] *** Ajay19 has quit IRC (Client Quit) [00:01] *** Ajay14 has joined #archiveteam-bs [00:03] *** Ajay1 has quit IRC (Ping timeout: 265 seconds) [00:03] *** Ajay14 is now known as Ajay1 [00:04] *** Ajay15 has joined #archiveteam-bs [00:08] *** Ajay1 has quit IRC (Ping timeout: 265 seconds) [00:08] *** Ajay15 is now known as Ajay1 [00:08] *** Ajay10 has joined #archiveteam-bs [00:10] *** Ajay10 has quit IRC (Client Quit) [00:11] *** Ajay10 has joined #archiveteam-bs [00:13] *** Ajay1 has quit IRC (Ping timeout: 265 seconds) [00:13] *** Ajay10 is now known as Ajay1 [00:14] *** Ajay14 has joined #archiveteam-bs [00:15] *** Ajay14 is now known as Ajay [00:16] *** Ajay has quit IRC (Client Quit) [00:18] *** Ajay1 has quit IRC (Ping timeout: 265 seconds) [00:24] *** jshoard has quit IRC (Quit: Leaving) [01:18] *** RichardG_ has joined #archiveteam-bs [01:24] *** RichardG has quit IRC (Read error: Operation timed out) [01:28] *** lennier2 has joined #archiveteam-bs [01:32] *** lennier1 has quit IRC (Ping timeout: 272 seconds) [01:32] *** lennier2 is now known as lennier1 [01:52] ...So, I was going to archive each of those file downloads from https://d-indiegames.blogspot.com/ via AB - some had dead links, this one like https://d-indiegames.blogspot.com/2014/03/misao.html in particular, this link being http://www.mediafire.com/download/kba90hx4zrfkwa9/Misao.zip got DMCA'd by Nintendo...even though looking from the screensho [01:52] ts, there isn't anything Nintendo related at all [01:54] https://vgperson.com/games/misao.htm [01:55] That is weird, though [02:00] https://web.archive.org/web/*/https://vgperson.com/games/Misao303.zip [02:00] Tue, 07 Apr 2020 12:42:17 GMT (why: wikicollections, wikipedia-eventstream, wikipediaoutlinks) [02:00] interesting way for IA to grab this file [02:13] *** HP_Archiv has joined #archiveteam-bs [02:23] *** Pixi has quit IRC (Read error: Connection reset by peer) [02:24] There's this page https://d-indiegames.blogspot.com/2014/08/misao-version-3.html - and that download link is still up fortunately [02:24] ...It's just really bizarre that somehow this was taken down by Nintendo :/ [02:29] *** Pixi has joined #archiveteam-bs [03:17] *** HP_Archiv has quit IRC (Quit: Leaving) [03:25] Interesting, this Tumblr account has it set up which https://terriball-tl.tumblr.com/ redirects to https://terriball-tl.tumblr.com/tagged/release [03:26] *** qw3rty_ has joined #archiveteam-bs [03:27] voltagex on the HN page: "A 2019 paper says there's 47TB of Docker images on the Hub. Get scraping." [03:27] I've been meaning to contact voltagex about the Samsung VR thing anyway... [03:28] That quote wasn't cited, obviously [03:28] *** wyatt8740 has quit IRC (Read error: Operation timed out) [03:30] Wait, he's here [03:33] *** qw3rty has quit IRC (Read error: Operation timed out) [04:20] And apparently put the link in -ot [05:18] *** phuzion has joined #archiveteam-bs [05:19] *** phuzion_ has quit IRC (Read error: Connection reset by peer) [05:21] *** Ctrl has quit IRC (Read error: Operation timed out) [05:21] *** wessel152 has quit IRC (Read error: Operation timed out) [05:22] *** Ctrl has joined #archiveteam-bs [05:31] *** Ctrl has quit IRC (Read error: Operation timed out) [05:32] *** legoktm has quit IRC (Read error: Connection reset by peer) [05:32] *** BnAboyZ has quit IRC (Read error: Connection reset by peer) [05:32] *** underscor has quit IRC (Read error: Connection reset by peer) [05:32] *** underscor has joined #archiveteam-bs [05:33] *** BnAboyZ has joined #archiveteam-bs [05:34] *** legoktm has joined #archiveteam-bs [05:43] *** Ctrl has joined #archiveteam-bs [05:44] *** britmob_ has joined #archiveteam-bs [05:45] *** Jake has quit IRC (Read error: Operation timed out) [05:49] *** britmob has quit IRC (Read error: Operation timed out) [05:54] *** Jake has joined #archiveteam-bs [06:36] *** HP_Archiv has joined #archiveteam-bs [07:16] *** HP_Archiv has quit IRC (Read error: Connection reset by peer) [07:26] *** HP_Archiv has joined #archiveteam-bs [07:55] *** user_ has joined #archiveteam-bs [08:09] Hi, I have a couple of questions. I've partially archived a newspaper website (zviazda.by) with Wget and would like to upload the resulting .warc.gz data to the Internet Archive. Could you please advise: [08:09] - How exactly can I tag my items with the subject keyword "archiveteam"? Are there any working examples for the "ia" command line tool? [08:09] - A few .warc.gz blocks were lost because of low disk space on my machine. Is it still OK to upload the remaining blocks, or should I fully recrawl the site? [08:09] Thanks in advance for any comments. [08:13] *** kiska1825 has quit IRC (Read error: Operation timed out) [08:14] *** Ryz has quit IRC (Quit: Ping timeout (120 seconds)) [08:16] *** jshoard has joined #archiveteam-bs [08:17] *** Ryz has joined #archiveteam-bs [08:17] *** kiska1825 has joined #archiveteam-bs [08:18] *** svchfoo3 sets mode: +o Ryz [08:59] Can you recrawl the parts that were not saved? [08:59] *** Jonimoose has quit IRC (Read error: Operation timed out) [09:00] *** Jonimoose has joined #archiveteam-bs [09:00] *** HP_Archiv has quit IRC (Quit: Leaving) [09:02] And I think you modify metadata either by having like the warc headers or using `ia metadata` https://archive.org/services/docs/api/internetarchive/cli.html#metadata [09:04] user_: IIRC the proper set of arguments would be '-m "subject:Archiveteam"'; but as (unless you have a more familiar nick) no one here knows you, your warcs would not be added to the Wayback Machine, nor (I think) moved to the Archiveteam collection [09:05] (You may be able to do without the quotes around the second argument depending on your shell, obviously) [09:07] *** mgrandi has left [09:07] *** mgrandi has joined #archiveteam-bs [09:09] *** VerifiedJ has joined #archiveteam-bs [09:10] : Yes, I'm going to recrawl unsaved parts, it will take several days. Basically it will be another set of .warc.gz blocks. [09:10] , : Thanks for these instructions. I don't have a more familiar nick, as I'm writing to this IRC channel for the first time. Could you please explain what details should I provide about myself to be able to add warcs to the ArchiveTeam collection? [09:16] *** BnAboyZ has quit IRC (Read error: Connection reset by peer) [09:18] *** BnAboyZ has joined #archiveteam-bs [09:22] *** Ctrl has quit IRC (Read error: Operation timed out) [09:22] *** BnAboyZ has quit IRC (Read error: Connection reset by peer) [09:23] *** BnAboyZ has joined #archiveteam-bs [09:23] *** kyledrake has quit IRC (Read error: Operation timed out) [09:24] *** kyledrake has joined #archiveteam-bs [09:25] *** Ctrl has joined #archiveteam-bs [09:33] @JAA: did anyone ever look at that hltv thing? Do we need something to make a url crawler script? [09:33] user_: YOu can stay on until the right people get on who can really answer that, but in any case I don't think your chances are that good, especially with the compression thing [09:35] And the wget issue with the brackets in the warc (where wget outputs a warc format slightly different from the de facto (but technically wrong) standard) [09:35] Though maybe the WBM can get around that, I haven't been keeping track of it [09:36] I don't have a working example for the ia upload tool, someone else might do though. As for the missing blocks, obviously a full recrawl would be ideal but if the warcs are still usable without it then that's fine (just obviously.. the data will be missing) [09:36] *** betamax has quit IRC (Read error: Operation timed out) [09:36] I will note that your warcs will *not* end up available in the wayback machine though [09:43] *** pikami has quit IRC (Ping timeout: 615 seconds) [09:43] *** wessel152 has joined #archiveteam-bs [09:43] *** phuzion has quit IRC (Ping timeout: 615 seconds) [09:43] *** phuzion has joined #archiveteam-bs [09:46] , : Got it, thanks! As the website isn't dying as of yet, I'll do a recrawl. I'm aware of the brackets issue – would you please recommend a command line tool other than wget that I can use for recrawling? Basically I want to back up a list of URLs (news stories published by the website) without doing a full traversal, so I thought a feature-rich crawler like Heritrix may be an overkill. [09:46] *** pikami has joined #archiveteam-bs [09:47] brackets issue? [09:47] *** betamax has joined #archiveteam-bs [09:50] Discussed here: http://fileformats.archiveteam.org/wiki/WARC (Ctrl+F "wget") [09:54] That's something I'm interested in looking into as well (saving the full contents of various blogs into the Wayback Machine, in my case) [10:02] user_: we use https://github.com/ArchiveTeam/wget-lua for most/all of our projects, I'd say that's safe [10:02] Arcorann_: archivebot is likely your best bet. You won't be able to upload to IA and have warcs ingested into the IA on your own [10:10] : Thank you for sharing this! Sorry for another stupid question, but does wget-lua accept the same command line parameters as GNU wget? I'm running the crawl like this: [10:10] wget -e robots=off --page-requisites --waitretry 5 --timeout 60 --tries 5 --wait 2 --random-wait --warc-header "operator: " --warc-cdx --warc-file="zviazda-2020.08.01" -U "Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0" --input-file="links.txt" --reject-regex 'zviazda_theme/img/(blank|fancybox)' --warc-max-size=100M [10:10] Will it also work with wget-lua? And should I change any of the arguments for compatibility with the existing practice? [10:21] I think wget Lua just adds additional arguments [10:24] It seems fine, you can see what the arguments are for projects: https://github.com/ArchiveTeam/github-grab/blob/master/pipeline.py#L192 [10:24] *** recruit_m has joined #archiveteam-bs [10:25] @kaz are you here [10:25] hello [10:25] user_: I believe so yes [10:25] ok, thanks for the linkl [10:26] Also: does anyone know what the "new style hash" is for a CDX file? Is it base64 sha1? The documentation literally just says "new style hash" [10:27] (The other style of metadata file format seems much more documented and easy to parse but I digress) [10:31] cc JAA [10:32] Great, thank you guys! [11:02] *** godane has quit IRC (Ping timeout: 260 seconds) [11:12] mgrandi: I haven't looked at HLTV yet. [11:12] mgrandi: In my experience, the 'new style hash' in the CDX at least as written by IA is the same as recorded in the WARC, i.e. SHA-1 in base36. [11:21] so Jason Scott (from the twitter link at the top of this channel) does a paid podcast? how does that add up? does someone have links for free or a comment on how you can be an archivist and put content behind paywalls? [11:25] *** godane has joined #archiveteam-bs [11:33] *** omglolba- has joined #archiveteam-bs [11:33] *** omglolbah has quit IRC (Read error: Connection reset by peer) [11:40] Kaz: thanks for the response. Are there any specific requirements for using archivebot? [11:41] recruit_m: we're archivists, not communists [11:41] I mean I'm sure there's a few, but as a whole nobody's got an issue with paywalls existing [11:42] ok [11:42] when will the podcast be archived :^ [11:43] Arcorann_: 'read the docs' basically. I don't use it much so there's definitely people that know more about the day-to-day [11:43] I see [11:43] drop into #archivebot if you're not already [11:43] done [11:45] recruit_m: i would imagine it's already archived somewhere [11:45] that doesn't mean it's *public* [11:45] interesting [11:46] so how do open source/culture and archiv(ism?) relate? [11:47] well, it's certainly easier to archive things that are already public / open source [11:50] ok, thanks [11:51] I am new and exploring, at work, are there network or capacity overload risks I may bump into? [11:56] *** trc has quit IRC (Quit: Goodbye) [13:14] *** Arcorann_ is now known as Arcorann [14:08] *** fredgido has quit IRC (Leaving) [14:54] *** recruit_m has quit IRC (Ping timeout: 252 seconds) [15:05] *** Ctrl has quit IRC (Read error: Operation timed out) [15:16] *** Nikchemny has joined #archiveteam-bs [15:22] JAA: As I know, AT saves flash games, right? https://vk.com/dev/no_flash and https://vk.com/dev/no_flash_2.0 may be interesting. The games are only for users, so AT must have account(s) on Vk [15:23] Kaz VoynichCr maybe SketchCow [15:25] *** Ctrl has joined #archiveteam-bs [15:34] I mean, all the games were before 2018 [15:36] *** Nikchemny has quit IRC (Quit: Page closed) [15:36] *** Arcorann has quit IRC (Read error: Connection reset by peer) [16:18] *** Doran has joined #archiveteam-bs [16:25] *** Doranwen has quit IRC (Read error: Operation timed out) [16:29] *** Doran has quit IRC (Ping timeout: 272 seconds) [16:29] *** Doran has joined #archiveteam-bs [16:41] *** Raccoon has quit IRC (Ping timeout: 272 seconds) [17:13] *** britmob_ has quit IRC (Read error: Connection reset by peer) [17:15] *** britmob has joined #archiveteam-bs [17:38] Nikchemny: Flashpoint is the project you're looking for I think. We haven't really done much Flash stuff here as far as I know. [17:42] *** OrIdow6^2 has joined #archiveteam-bs [17:44] *** OrIdow6 has quit IRC (Ping timeout: 265 seconds) [17:44] OK, don't do that. [17:45] Who... the fuck.... is recruit_m [17:46] Like, did he just wander into the channel to go "So.... Jason has a paid podcast... explain THAT" [17:46] But, just for the future, although whatever: [17:46] - Podcast episodes are released on Patreon [17:46] - Anywhere from 2 weeks to a month after they are released on youtube, apple podcasts, google podcasts, libsyn and a couple others I don't know. For free. [17:47] - Episodes are all archived at archive.org in a collection [17:47] - Fuck you [17:48] SketchCow: But don't you know that all content from archivists is required to be in the public domain immediately‽ [17:49] Like, what even the fuck WAS that [17:51] Does remind me to release an episode though [17:52] Also I see the archiveteam inbox on IA needs me to write more sorting routines [18:03] I've heard rumors people here have jobs writing closed source software. :) [18:10] Whaaaaa [18:10] Mind blown. [18:35] The entire POINT of archiveteam is all the compromises that have been made and trying to work with them [18:35] The deriver queue is finally going down on IA [18:35] https://archive.org/~tracey/mrtg/derivesg.html [18:35] It is merely the worst it has been in a year instead of the worst in a decade [20:20] *** jshoard_ has joined #archiveteam-bs [20:21] *** chfoo has quit IRC (Read error: Operation timed out) [20:21] *** chfoo has joined #archiveteam-bs [20:21] *** pikami has quit IRC (Read error: Operation timed out) [20:22] *** sHATNER has quit IRC (Read error: Operation timed out) [20:22] *** sHATNER has joined #archiveteam-bs [20:22] *** thejsa has quit IRC (Read error: Operation timed out) [20:22] *** svchfoo3 sets mode: +o chfoo [20:22] *** Jon has quit IRC (Write error: Broken pipe) [20:22] *** jshoard has quit IRC (Write error: Broken pipe) [20:22] *** nico_32 has quit IRC (Read error: Operation timed out) [20:22] *** Jon has joined #archiveteam-bs [20:22] *** nico_32 has joined #archiveteam-bs [20:22] *** wessel152 has quit IRC (Write error: Broken pipe) [20:23] *** pikami has joined #archiveteam-bs [20:23] *** kisspunch has quit IRC (Read error: Operation timed out) [20:23] *** kisspunch has joined #archiveteam-bs [20:24] *** VerifiedJ has quit IRC (Read error: Operation timed out) [20:25] *** MillerBOS has quit IRC (Read error: Operation timed out) [20:25] *** kiskaWee has quit IRC (Read error: Operation timed out) [20:26] *** Yurume has quit IRC (Read error: Operation timed out) [20:26] *** Ctrl has quit IRC (Read error: Operation timed out) [20:27] *** Yurume has joined #archiveteam-bs [20:30] *** VerifiedJ has joined #archiveteam-bs [20:30] *** MillerBOS has joined #archiveteam-bs [20:31] *** kiskaWee has joined #archiveteam-bs [20:32] *** thejsa has joined #archiveteam-bs [20:32] *** BnAboyZ has quit IRC (Read error: Connection reset by peer) [20:32] *** VerifiedJ has quit IRC (Client Quit) [20:33] *** Ctrl has joined #archiveteam-bs [20:35] *** BnAboyZ has joined #archiveteam-bs [20:51] What was the peak queue size at the worst in a decade? [20:52] Also, thanks for the URL. I had been looking for those graphs before and couldn't find them anymore. [21:02] Is that a good thing? Like the backlog of stuff to process is getting smaller and more manageable? o: [21:04] Yep [21:39] The contents being processed is more than just AB's stuff to clarify? [21:39] Lots more I'm sure. [21:41] What are the effects of this being lowered down over time? Like content would appear in IA or WBM more faster? [21:41] Yup [21:41] Derives have been delayed for a week or more from what I've heard. [21:42] Ooo, instead of 2 days, it could be 1 day or maybe less~ [21:42] Oh, it was much longer, welp [21:42] That should go back down to reasonable times soon. [22:22] *** wyatt8740 has joined #archiveteam-bs [22:24] *** wyatt8740 has quit IRC (Read error: Operation timed out) [22:32] *** trc has joined #archiveteam-bs [22:39] *** wyatt8740 has joined #archiveteam-bs [22:41] *** wyatt8740 has quit IRC (Read error: Operation timed out) [22:42] *** wyatt8740 has joined #archiveteam-bs [22:50] *** wyatt8740 has quit IRC (Read error: Operation timed out) [22:51] *** wyatt8740 has joined #archiveteam-bs [22:55] *** BlueMax has joined #archiveteam-bs [23:03] So, I was trying to find archives of Prevention magazine ( https://en.wikipedia.org/wiki/Prevention_(magazine) ) and Google Books as a small selection of stuff, I browsed one of 'em because I got the small magazines from my parents, thinking, this could be a sufficent replacement so I don't have to manually mine for website links; [23:03] ...And then this came up: https://books.google.com/books?id=9MYDAAAAMBAJ&lpg=PP1&pg=PA56#v=onepage&q&f=false [23:04] Not only is this a crappy scan, but it doesn't match the ad from the magazine that I have at all [23:06] *** wessel152 has joined #archiveteam-bs [23:07] *** Raccoon has joined #archiveteam-bs [23:08] Has there been an initiative for bots or people to scan through documents like magazines and mine out links that may or may not exist anymore? I managed to mine at least 1 website that I threw into AB, which is https://www.videoeye.com/ - which came from an advertisement [23:12] *** Raccoon` has joined #archiveteam-bs [23:14] *** Raccoon has quit IRC (Ping timeout: 376 seconds) [23:14] *** Raccoon` is now known as Raccoon [23:27] *** Raccoon` has joined #archiveteam-bs [23:28] *** wyatt8740 has quit IRC (Read error: Operation timed out) [23:31] *** Raccoon has quit IRC (Ping timeout: 272 seconds) [23:32] *** Raccoon` has quit IRC (Ping timeout: 265 seconds) [23:32] *** wyatt8740 has joined #archiveteam-bs [23:32] *** Raccoon has joined #archiveteam-bs [23:46] *** jshoard_ has quit IRC (Leaving)