[00:01] *** Raccoon is now known as A-real-ni
[00:01] *** A-real-ni is now known as Raccoon
[00:01] *** Ajay19 has quit IRC (Client Quit)
[00:01] *** Ajay14 has joined #archiveteam-bs
[00:03] *** Ajay1 has quit IRC (Ping timeout: 265 seconds)
[00:03] *** Ajay14 is now known as Ajay1
[00:04] *** Ajay15 has joined #archiveteam-bs
[00:08] *** Ajay1 has quit IRC (Ping timeout: 265 seconds)
[00:08] *** Ajay15 is now known as Ajay1
[00:08] *** Ajay10 has joined #archiveteam-bs
[00:10] *** Ajay10 has quit IRC (Client Quit)
[00:11] *** Ajay10 has joined #archiveteam-bs
[00:13] *** Ajay1 has quit IRC (Ping timeout: 265 seconds)
[00:13] *** Ajay10 is now known as Ajay1
[00:14] *** Ajay14 has joined #archiveteam-bs
[00:15] *** Ajay14 is now known as Ajay
[00:16] *** Ajay has quit IRC (Client Quit)
[00:18] *** Ajay1 has quit IRC (Ping timeout: 265 seconds)
[00:24] *** jshoard has quit IRC (Quit: Leaving)
[01:18] *** RichardG_ has joined #archiveteam-bs
[01:24] *** RichardG has quit IRC (Read error: Operation timed out)
[01:28] *** lennier2 has joined #archiveteam-bs
[01:32] *** lennier1 has quit IRC (Ping timeout: 272 seconds)
[01:32] *** lennier2 is now known as lennier1
[01:52] <Ryz> ...So, I was going to archive each of those file downloads from https://d-indiegames.blogspot.com/ via AB - some had dead links, this one like https://d-indiegames.blogspot.com/2014/03/misao.html in particular, this link being http://www.mediafire.com/download/kba90hx4zrfkwa9/Misao.zip got DMCA'd by Nintendo...even though looking from the screensho
[01:52] <Ryz> ts, there isn't anything Nintendo related at all
[01:54] <Arcorann_> https://vgperson.com/games/misao.htm
[01:55] <Arcorann_> That is weird, though
[02:00] <nico_32> https://web.archive.org/web/*/https://vgperson.com/games/Misao303.zip
[02:00] <nico_32> Tue, 07 Apr 2020 12:42:17 GMT (why: wikicollections, wikipedia-eventstream, wikipediaoutlinks)
[02:00] <nico_32> interesting way for IA to grab this file
[02:13] *** HP_Archiv has joined #archiveteam-bs
[02:23] *** Pixi has quit IRC (Read error: Connection reset by peer)
[02:24] <Ryz> There's this page https://d-indiegames.blogspot.com/2014/08/misao-version-3.html - and that download link is still up fortunately
[02:24] <Ryz> ...It's just really bizarre that somehow this was taken down by Nintendo :/
[02:29] *** Pixi has joined #archiveteam-bs
[03:17] *** HP_Archiv has quit IRC (Quit: Leaving)
[03:25] <Ryz> Interesting, this Tumblr account has it set up which https://terriball-tl.tumblr.com/ redirects to https://terriball-tl.tumblr.com/tagged/release
[03:26] *** qw3rty_ has joined #archiveteam-bs
[03:27] <OrIdow6> voltagex on the HN page: "A 2019 paper says there's 47TB of Docker images on the Hub. Get scraping."
[03:27] <OrIdow6> I've been meaning to contact voltagex about the Samsung VR thing anyway...
[03:28] <OrIdow6> That quote wasn't cited, obviously
[03:28] *** wyatt8740 has quit IRC (Read error: Operation timed out)
[03:30] <OrIdow6> Wait, he's here
[03:33] *** qw3rty has quit IRC (Read error: Operation timed out)
[04:20] <OrIdow6> And apparently put the link in -ot
[05:18] *** phuzion has joined #archiveteam-bs
[05:19] *** phuzion_ has quit IRC (Read error: Connection reset by peer)
[05:21] *** Ctrl has quit IRC (Read error: Operation timed out)
[05:21] *** wessel152 has quit IRC (Read error: Operation timed out)
[05:22] *** Ctrl has joined #archiveteam-bs
[05:31] *** Ctrl has quit IRC (Read error: Operation timed out)
[05:32] *** legoktm has quit IRC (Read error: Connection reset by peer)
[05:32] *** BnAboyZ has quit IRC (Read error: Connection reset by peer)
[05:32] *** underscor has quit IRC (Read error: Connection reset by peer)
[05:32] *** underscor has joined #archiveteam-bs
[05:33] *** BnAboyZ has joined #archiveteam-bs
[05:34] *** legoktm has joined #archiveteam-bs
[05:43] *** Ctrl has joined #archiveteam-bs
[05:44] *** britmob_ has joined #archiveteam-bs
[05:45] *** Jake has quit IRC (Read error: Operation timed out)
[05:49] *** britmob has quit IRC (Read error: Operation timed out)
[05:54] *** Jake has joined #archiveteam-bs
[06:36] *** HP_Archiv has joined #archiveteam-bs
[07:16] *** HP_Archiv has quit IRC (Read error: Connection reset by peer)
[07:26] *** HP_Archiv has joined #archiveteam-bs
[07:55] *** user_ has joined #archiveteam-bs
[08:09] <user_> Hi, I have a couple of questions. I've partially archived a newspaper website (zviazda.by) with Wget and would like to upload the resulting .warc.gz data to the Internet Archive. Could you please advise:
[08:09] <user_> - How exactly can I tag my items with the subject keyword "archiveteam"? Are there any working examples for the "ia" command line tool?
[08:09] <user_> - A few .warc.gz blocks were lost because of low disk space on my machine. Is it still OK to upload the remaining blocks, or should I fully recrawl the site?
[08:09] <user_> Thanks in advance for any comments.
[08:13] *** kiska1825 has quit IRC (Read error: Operation timed out)
[08:14] *** Ryz has quit IRC (Quit: Ping timeout (120 seconds))
[08:16] *** jshoard has joined #archiveteam-bs
[08:17] *** Ryz has joined #archiveteam-bs
[08:17] *** kiska1825 has joined #archiveteam-bs
[08:18] *** svchfoo3 sets mode: +o Ryz
[08:59] <mgrandi> Can you recrawl the parts that were not saved?
[08:59] *** Jonimoose has quit IRC (Read error: Operation timed out)
[09:00] *** Jonimoose has joined #archiveteam-bs
[09:00] *** HP_Archiv has quit IRC (Quit: Leaving)
[09:02] <mgrandi> And I think you modify metadata either by having like the warc headers or using `ia metadata` https://archive.org/services/docs/api/internetarchive/cli.html#metadata
[09:04] <OrIdow6> user_: IIRC the proper set of arguments would be '-m "subject:Archiveteam"'; but as (unless you have a more familiar nick) no one here knows you, your warcs would not be added to the Wayback Machine, nor (I think) moved to the Archiveteam collection
[09:05] <OrIdow6> (You may be able to do without the quotes around the second argument depending on your shell, obviously)
[09:07] *** mgrandi has left 
[09:07] *** mgrandi has joined #archiveteam-bs
[09:09] *** VerifiedJ has joined #archiveteam-bs
[09:10] <user_> <mgrandi>: Yes, I'm going to recrawl unsaved parts, it will take several days. Basically it will be another set of .warc.gz blocks.
[09:10] <user_> <mgrandi>, <OrIdow6>: Thanks for these instructions. I don't have a more familiar nick, as I'm writing to this IRC channel for the first time. Could you please explain what details should I provide about myself to be able to add warcs to the ArchiveTeam collection?
[09:16] *** BnAboyZ has quit IRC (Read error: Connection reset by peer)
[09:18] *** BnAboyZ has joined #archiveteam-bs
[09:22] *** Ctrl has quit IRC (Read error: Operation timed out)
[09:22] *** BnAboyZ has quit IRC (Read error: Connection reset by peer)
[09:23] *** BnAboyZ has joined #archiveteam-bs
[09:23] *** kyledrake has quit IRC (Read error: Operation timed out)
[09:24] *** kyledrake has joined #archiveteam-bs
[09:25] *** Ctrl has joined #archiveteam-bs
[09:33] <mgrandi> @JAA: did anyone ever look at that hltv thing? Do we need something to make a url crawler script?
[09:33] <OrIdow6> user_: YOu can stay on until the right people get on who can really answer that, but in any case I don't think your chances are that good, especially with the compression thing
[09:35] <OrIdow6> And the wget issue with the brackets in the warc (where wget outputs a warc format slightly different from the de facto (but technically wrong) standard)
[09:35] <OrIdow6> Though maybe the WBM can get around that, I haven't been keeping track of it
[09:36] <Kaz> I don't have a working example for the ia upload tool, someone else might do though. As for the missing blocks, obviously a full recrawl would be ideal but if the warcs are still usable without it then that's fine (just obviously.. the data will be missing)
[09:36] *** betamax has quit IRC (Read error: Operation timed out)
[09:36] <Kaz> I will note that your warcs will *not* end up available in the wayback machine though
[09:43] *** pikami has quit IRC (Ping timeout: 615 seconds)
[09:43] *** wessel152 has joined #archiveteam-bs
[09:43] *** phuzion has quit IRC (Ping timeout: 615 seconds)
[09:43] *** phuzion has joined #archiveteam-bs
[09:46] <user_> <OrIdow6>, <Kaz>: Got it, thanks! As the website isn't dying as of yet, I'll do a recrawl. I'm aware of the brackets issue – would you please recommend a command line tool other than wget that I can use for recrawling? Basically I want to back up a list of URLs (news stories published by the website) without doing a full traversal, so I thought a feature-rich crawler like Heritrix may be an overkill.
[09:46] *** pikami has joined #archiveteam-bs
[09:47] <Kaz> brackets issue? 
[09:47] *** betamax has joined #archiveteam-bs
[09:50] <user_> Discussed here: http://fileformats.archiveteam.org/wiki/WARC (Ctrl+F "wget")
[09:54] <Arcorann_> That's something I'm interested in looking into as well (saving the full contents of various blogs into the Wayback Machine, in my case)
[10:02] <Kaz> user_: we use https://github.com/ArchiveTeam/wget-lua for most/all of our projects, I'd say that's safe
[10:02] <Kaz> Arcorann_: archivebot is likely your best bet. You won't be able to upload to IA and have warcs ingested into the IA on your own
[10:10] <user_> <Kaz>: Thank you for sharing this! Sorry for another stupid question, but does wget-lua accept the same command line parameters as GNU wget? I'm running the crawl like this:
[10:10] <user_> wget -e robots=off --page-requisites --waitretry 5 --timeout 60 --tries 5 --wait 2 --random-wait --warc-header "operator: <redacted>" --warc-cdx --warc-file="zviazda-2020.08.01" -U "Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0" --input-file="links.txt" --reject-regex 'zviazda_theme/img/(blank|fancybox)' --warc-max-size=100M
[10:10] <user_> Will it also work with wget-lua? And should I change any of the arguments for compatibility with the existing practice?
[10:21] <mgrandi> I think wget Lua just adds additional arguments
[10:24] <mgrandi> It seems fine, you can see what the arguments are for projects:  https://github.com/ArchiveTeam/github-grab/blob/master/pipeline.py#L192
[10:24] *** recruit_m has joined #archiveteam-bs
[10:25] <recruit_m> @kaz are you here
[10:25] <Kaz> hello
[10:25] <Kaz> user_: I believe so yes
[10:25] <recruit_m> ok, thanks for the linkl
[10:26] <mgrandi> Also: does anyone know what the "new style hash" is for a CDX file? Is it base64 sha1? The documentation literally just says "new style hash"
[10:27] <mgrandi>  (The other style of metadata file format seems much more documented and easy to parse but I digress)
[10:31] <Kaz> cc JAA
[10:32] <user_> Great, thank you guys!
[11:02] *** godane has quit IRC (Ping timeout: 260 seconds)
[11:12] <JAA> mgrandi: I haven't looked at HLTV yet.
[11:12] <JAA> mgrandi: In my experience, the 'new style hash' in the CDX at least as written by IA is the same as recorded in the WARC, i.e. SHA-1 in base36.
[11:21] <recruit_m> so Jason Scott (from the twitter link at the top of this channel) does a paid podcast? how does that add up? does someone have links for free or a comment on how you can be an archivist and put content behind paywalls?
[11:25] *** godane has joined #archiveteam-bs
[11:33] *** omglolba- has joined #archiveteam-bs
[11:33] *** omglolbah has quit IRC (Read error: Connection reset by peer)
[11:40] <Arcorann_> Kaz: thanks for the response. Are there any specific requirements for using archivebot?
[11:41] <Kaz> recruit_m: we're archivists, not communists
[11:41] <Kaz> I mean I'm sure there's a few, but as a whole nobody's got an issue with paywalls existing
[11:42] <recruit_m> ok
[11:42] <recruit_m> when will the podcast be archived :^
[11:43] <Kaz> Arcorann_: 'read the docs' basically. I don't use it much so there's definitely people that know more about the day-to-day
[11:43] <Arcorann_> I see
[11:43] <Kaz> drop into #archivebot if you're not already
[11:43] <recruit_m> done
[11:45] <Kaz> recruit_m: i would imagine it's already archived somewhere
[11:45] <Kaz> that doesn't mean it's *public*
[11:45] <recruit_m> interesting
[11:46] <recruit_m> so how do open source/culture and archiv(ism?) relate?
[11:47] <Kaz> well, it's certainly easier to archive things that are already public / open source
[11:50] <recruit_m> ok, thanks
[11:51] <recruit_m> I am new and exploring, at work, are there network or capacity overload risks I may bump into? 
[11:56] *** trc has quit IRC (Quit: Goodbye)
[13:14] *** Arcorann_ is now known as Arcorann
[14:08] *** fredgido has quit IRC (Leaving)
[14:54] *** recruit_m has quit IRC (Ping timeout: 252 seconds)
[15:05] *** Ctrl has quit IRC (Read error: Operation timed out)
[15:16] *** Nikchemny has joined #archiveteam-bs
[15:22] <Nikchemny> JAA: As I know, AT saves flash games, right? https://vk.com/dev/no_flash and https://vk.com/dev/no_flash_2.0 may be interesting. The games are only for users, so AT must have account(s) on Vk
[15:23] <Nikchemny> Kaz VoynichCr maybe SketchCow
[15:25] *** Ctrl has joined #archiveteam-bs
[15:34] <Nikchemny> I mean, all the games were before 2018
[15:36] *** Nikchemny has quit IRC (Quit: Page closed)
[15:36] *** Arcorann has quit IRC (Read error: Connection reset by peer)
[16:18] *** Doran has joined #archiveteam-bs
[16:25] *** Doranwen has quit IRC (Read error: Operation timed out)
[16:29] *** Doran has quit IRC (Ping timeout: 272 seconds)
[16:29] *** Doran has joined #archiveteam-bs
[16:41] *** Raccoon has quit IRC (Ping timeout: 272 seconds)
[17:13] *** britmob_ has quit IRC (Read error: Connection reset by peer)
[17:15] *** britmob has joined #archiveteam-bs
[17:38] <JAA> Nikchemny: Flashpoint is the project you're looking for I think. We haven't really done much Flash stuff here as far as I know.
[17:42] *** OrIdow6^2 has joined #archiveteam-bs
[17:44] *** OrIdow6 has quit IRC (Ping timeout: 265 seconds)
[17:44] <SketchCow> OK, don't do that.
[17:45] <SketchCow> Who... the fuck.... is recruit_m
[17:46] <SketchCow> Like, did he just wander into the channel to go "So.... Jason has a paid podcast... explain THAT"
[17:46] <SketchCow> But, just for the future, although whatever:
[17:46] <SketchCow> - Podcast episodes are released on Patreon
[17:46] <SketchCow> - Anywhere from 2 weeks to a month after they are released on youtube, apple podcasts, google podcasts, libsyn and a couple others I don't know. For free.
[17:47] <SketchCow> - Episodes are all archived at archive.org in a collection
[17:47] <SketchCow> - Fuck you
[17:48] <JAA> SketchCow: But don't you know that all content from archivists is required to be in the public domain immediately‽
[17:49] <SketchCow> Like, what even the fuck WAS that
[17:51] <SketchCow> Does remind me to release an episode though
[17:52] <SketchCow> Also I see the archiveteam inbox on IA needs me to write more sorting routines
[18:03] <lennier1> I've heard rumors people here have jobs writing closed source software. :)
[18:10] <JAA> Whaaaaa
[18:10] <JAA> Mind blown.
[18:35] <SketchCow> The entire POINT of archiveteam is all the compromises that have been made and trying to work with them
[18:35] <SketchCow> The deriver queue is finally going down on IA
[18:35] <SketchCow> https://archive.org/~tracey/mrtg/derivesg.html
[18:35] <SketchCow> It is merely the worst it has been in a year instead of the worst in a decade
[20:20] *** jshoard_ has joined #archiveteam-bs
[20:21] *** chfoo has quit IRC (Read error: Operation timed out)
[20:21] *** chfoo has joined #archiveteam-bs
[20:21] *** pikami has quit IRC (Read error: Operation timed out)
[20:22] *** sHATNER has quit IRC (Read error: Operation timed out)
[20:22] *** sHATNER has joined #archiveteam-bs
[20:22] *** thejsa has quit IRC (Read error: Operation timed out)
[20:22] *** svchfoo3 sets mode: +o chfoo
[20:22] *** Jon has quit IRC (Write error: Broken pipe)
[20:22] *** jshoard has quit IRC (Write error: Broken pipe)
[20:22] *** nico_32 has quit IRC (Read error: Operation timed out)
[20:22] *** Jon has joined #archiveteam-bs
[20:22] *** nico_32 has joined #archiveteam-bs
[20:22] *** wessel152 has quit IRC (Write error: Broken pipe)
[20:23] *** pikami has joined #archiveteam-bs
[20:23] *** kisspunch has quit IRC (Read error: Operation timed out)
[20:23] *** kisspunch has joined #archiveteam-bs
[20:24] *** VerifiedJ has quit IRC (Read error: Operation timed out)
[20:25] *** MillerBOS has quit IRC (Read error: Operation timed out)
[20:25] *** kiskaWee has quit IRC (Read error: Operation timed out)
[20:26] *** Yurume has quit IRC (Read error: Operation timed out)
[20:26] *** Ctrl has quit IRC (Read error: Operation timed out)
[20:27] *** Yurume has joined #archiveteam-bs
[20:30] *** VerifiedJ has joined #archiveteam-bs
[20:30] *** MillerBOS has joined #archiveteam-bs
[20:31] *** kiskaWee has joined #archiveteam-bs
[20:32] *** thejsa has joined #archiveteam-bs
[20:32] *** BnAboyZ has quit IRC (Read error: Connection reset by peer)
[20:32] *** VerifiedJ has quit IRC (Client Quit)
[20:33] *** Ctrl has joined #archiveteam-bs
[20:35] *** BnAboyZ has joined #archiveteam-bs
[20:51] <JAA> What was the peak queue size at the worst in a decade?
[20:52] <JAA> Also, thanks for the URL. I had been looking for those graphs before and couldn't find them anymore.
[21:02] <Ryz> Is that a good thing? Like the backlog of stuff to process is getting smaller and more manageable? o:
[21:04] <JAA> Yep
[21:39] <Ryz> The contents being processed is more than just AB's stuff to clarify?
[21:39] <JAA> Lots more I'm sure.
[21:41] <Ryz> What are the effects of this being lowered down over time? Like content would appear in IA or WBM more faster?
[21:41] <JAA> Yup
[21:41] <JAA> Derives have been delayed for a week or more from what I've heard.
[21:42] <Ryz> Ooo, instead of 2 days, it could be 1 day or maybe less~
[21:42] <Ryz> Oh, it was much longer, welp
[21:42] <JAA> That should go back down to reasonable times soon.
[22:22] *** wyatt8740 has joined #archiveteam-bs
[22:24] *** wyatt8740 has quit IRC (Read error: Operation timed out)
[22:32] *** trc has joined #archiveteam-bs
[22:39] *** wyatt8740 has joined #archiveteam-bs
[22:41] *** wyatt8740 has quit IRC (Read error: Operation timed out)
[22:42] *** wyatt8740 has joined #archiveteam-bs
[22:50] *** wyatt8740 has quit IRC (Read error: Operation timed out)
[22:51] *** wyatt8740 has joined #archiveteam-bs
[22:55] *** BlueMax has joined #archiveteam-bs
[23:03] <Ryz> So, I was trying to find archives of Prevention magazine ( https://en.wikipedia.org/wiki/Prevention_(magazine) ) and Google Books as a small selection of stuff, I browsed one of 'em because I got the small magazines from my parents, thinking, this could be a sufficent replacement so I don't have to manually mine for website links;
[23:03] <Ryz> ...And then this came up: https://books.google.com/books?id=9MYDAAAAMBAJ&lpg=PP1&pg=PA56#v=onepage&q&f=false
[23:04] <Ryz> Not only is this a crappy scan, but it doesn't match the ad from the magazine that I have at all
[23:06] *** wessel152 has joined #archiveteam-bs
[23:07] *** Raccoon has joined #archiveteam-bs
[23:08] <Ryz> Has there been an initiative for bots or people to scan through documents like magazines and mine out links that may or may not exist anymore? I managed to mine at least 1 website that I threw into AB, which is https://www.videoeye.com/ - which came from an advertisement
[23:12] *** Raccoon` has joined #archiveteam-bs
[23:14] *** Raccoon has quit IRC (Ping timeout: 376 seconds)
[23:14] *** Raccoon` is now known as Raccoon
[23:27] *** Raccoon` has joined #archiveteam-bs
[23:28] *** wyatt8740 has quit IRC (Read error: Operation timed out)
[23:31] *** Raccoon has quit IRC (Ping timeout: 272 seconds)
[23:32] *** Raccoon` has quit IRC (Ping timeout: 265 seconds)
[23:32] *** wyatt8740 has joined #archiveteam-bs
[23:32] *** Raccoon has joined #archiveteam-bs
[23:46] *** jshoard_ has quit IRC (Leaving)