[00:02] *** icedice has joined #archiveteam-bs
[00:02] *** robogoat has quit IRC (Read error: Operation timed out)
[00:11] *** manjaro-u has quit IRC (Ping timeout: 258 seconds)
[00:18] *** BartoCH has joined #archiveteam-bs
[00:51] *** BlueMax has quit IRC (Quit: Leaving)
[01:18] *** Video has quit IRC (Read error: Operation timed out)
[01:25] *** sec^nd has quit IRC (Ping timeout: 745 seconds)
[01:25] *** odemgi has joined #archiveteam-bs
[01:28] *** odemgi_ has quit IRC (Read error: Connection reset by peer)
[01:35] *** RichardG_ has quit IRC (Keyboard not found, press F1 to continue)
[01:39] *** RichardG has joined #archiveteam-bs
[02:24] *** britmob has joined #archiveteam-bs
[02:24] *** BlueMax has joined #archiveteam-bs
[03:11] *** bluefoo has quit IRC (Read error: Connection reset by peer)
[03:32] *** asdf0101 has quit IRC (The Lounge - https://thelounge.chat)
[03:32] *** markedL has quit IRC (Quit: The Lounge - https://thelounge.chat)
[03:40] *** asdf0101 has joined #archiveteam-bs
[03:40] *** markedL has joined #archiveteam-bs
[04:00] *** markedH has quit IRC (Read error: Operation timed out)
[04:29] *** odemgi_ has joined #archiveteam-bs
[04:32] *** odemgi has quit IRC (Read error: Operation timed out)
[04:34] *** qw3rty2 has joined #archiveteam-bs
[04:41] *** qw3rty has quit IRC (Ping timeout: 745 seconds)
[05:05] *** second has joined #archiveteam-bs
[05:27] *** kiska18 has quit IRC (Remote host closed the connection)
[05:27] *** Ryz has quit IRC (Remote host closed the connection)
[05:27] *** kiska18 has joined #archiveteam-bs
[05:27] *** Fusl__ sets mode: +o kiska18
[05:27] *** Fusl sets mode: +o kiska18
[05:27] *** Fusl_ sets mode: +o kiska18
[05:27] *** Ryz has joined #archiveteam-bs
[05:33] *** eythian has quit IRC (Remote host closed the connection)
[05:45] *** raeyulca has joined #archiveteam-bs
[05:47] *** eythian has joined #archiveteam-bs
[05:48] <raeyulca> Hello, everyone. Someone told me about you guys - I have been archiving TikTok and heard that you may be interested.
[06:09] *** markedH has joined #archiveteam-bs
[06:12] <markedL> Someone was talking about that site the other day.  kpcyrd ^
[06:16] *** kiskabak has quit IRC (Read error: Operation timed out)
[06:17] *** jake_test has quit IRC (Read error: Operation timed out)
[06:19] *** Panasonic has joined #archiveteam-bs
[06:19] *** fredgido has joined #archiveteam-bs
[06:19] *** markedH has quit IRC (Read error: Operation timed out)
[06:19] *** jake_test has joined #archiveteam-bs
[06:19] *** odemgi has joined #archiveteam-bs
[06:20] *** Damme has joined #archiveteam-bs
[06:20] *** odemgi has quit IRC (Read error: Connection reset by peer)
[06:20] *** odemgi has joined #archiveteam-bs
[06:20] <markedL> So what's the storage consumed by 23.5M videos?
[06:20] *** systwi_ has joined #archiveteam-bs
[06:21] *** Raccoon has quit IRC (Read error: Connection reset by peer)
[06:21] *** benjinsmi has quit IRC (Read error: Operation timed out)
[06:22] *** Raccoon has joined #archiveteam-bs
[06:23] *** Damme_ has quit IRC (Read error: Operation timed out)
[06:23] *** benjins has joined #archiveteam-bs
[06:23] *** fredgido_ has quit IRC (Read error: Operation timed out)
[06:23] *** Ravenloft has quit IRC (Read error: Operation timed out)
[06:24] *** markedH has joined #archiveteam-bs
[06:24] *** Mayonaise has quit IRC (Read error: Operation timed out)
[06:26] <raeyulca> ~55TB
[06:27] *** Mayonaise has joined #archiveteam-bs
[06:27] *** odemgi_ has quit IRC (Ping timeout: 612 seconds)
[06:28] <raeyulca> wait nvm, 64TB
[06:29] *** systwi has quit IRC (Ping timeout: 612 seconds)
[06:30] *** jake_test has quit IRC (Ping timeout: 492 seconds)
[06:31] *** jake_test has joined #archiveteam-bs
[06:45] <SketchCow> You have 64tb of videos?
[06:49] <raeyulca> yes
[06:55] *** luckcolor has quit IRC (Read error: Operation timed out)
[06:58] *** luckcolor has joined #archiveteam-bs
[07:03] *** chfoo_ is now known as chfoo
[07:10] *** phillipsj has quit IRC (Ping timeout: 252 seconds)
[07:16] *** phillipsj has joined #archiveteam-bs
[07:16] *** robogoat has joined #archiveteam-bs
[07:22] <markedL> I haven't looked into the site until literally just now.  Looks like the ID's are really large numbers.  How did you do discovery?  If you can stick around for a few hours, others with more informed questions will start strolling in for the day 
[07:25] <raeyulca> I'm PST so I'll be going to bed very shortly.
[07:25] <raeyulca> >How did you do discovery?
[07:25] <raeyulca> I reverse-engineered the web API (the mobile API proved to be mostly too difficult)
[07:27] *** bluefoo has joined #archiveteam-bs
[07:28] <raeyulca> I have not found any correlation between the video IDs and anything I could think to be correlated (user Id, time of upload, etc). That could be solved with ML but I didn't think it was worth it, when the API just gives you valid video IDs
[07:28] <raeyulca> Something to note about video ID urls: here's an example
[07:29] <raeyulca> https://www.tiktok.com/@turkeycomedy/video/6758661713001745670?langCountry=en
[07:30] <raeyulca> This is not valid. The video "6758661713001745670" does not belong to @turkeycomedy, but actually belongs to @nicoleutyro
[07:30] <raeyulca> but the URL will resolve anyways if the ID is valid
[07:36] <markedL> yeah, US pacific form the last hold outs.  what percentage do you estimate 23.5million covers?
[07:38] <markedL> I've heard of people decompiling other mobile apps to get API keys, but if there's a web site, I agree it's not as worth it. 
[07:39] <raeyulca> I don't know the answer to that question. It's 146k individual users. I (ab)use the "recommended" feature, which I'm assuming uses some sort hand-wavy algorithm to give "similar" users, and I maybe only get 10 or so new users a day
[07:40] <raeyulca> Decompiling the app to get an API key is not possible in this case
[07:40] <raeyulca> It uses a time-based key ("X-Gorgon") in the header, to verify all API requests. It does some bitwise manipulation on the current unix timestamp in MS and some startup params, but I was never able to get farther than that.
[07:43] *** m007a83_ has joined #archiveteam-bs
[07:43] *** m007a83_ has quit IRC (Connection closed)
[07:45] <markedL> fancy.  this sounds interesting. if we get more interest we'll form a dedicated channel.  there's a wiki page but just started, https://www.archiveteam.org/index.php?title=TikTok
[07:47] *** m007a83 has quit IRC (Ping timeout: 252 seconds)
[07:48] *** m007a83 has joined #archiveteam-bs
[07:59] *** m007a83 has quit IRC (Quit: Fuck you Comcast)
[09:38] *** omglolbah has quit IRC (Quit: ZNC - https://znc.in)
[09:49] *** Flashfire has quit IRC (Remote host closed the connection)
[09:49] *** kiska has quit IRC (Remote host closed the connection)
[09:50] *** kiska has joined #archiveteam-bs
[09:50] *** Fusl__ sets mode: +o kiska
[09:50] *** Fusl sets mode: +o kiska
[09:50] *** Fusl_ sets mode: +o kiska
[09:50] *** Flashfire has joined #archiveteam-bs
[10:26] *** omglolbah has joined #archiveteam-bs
[11:23] *** BlueMax has quit IRC (Read error: Connection reset by peer)
[11:46] *** odemg has joined #archiveteam-bs
[12:36] *** markedH has quit IRC (Leaving)
[13:10] *** HP_Archiv has joined #archiveteam-bs
[13:12] <HP_Archiv> Hey everyone, I have a few site requests for archiving. Can someone please throw this into archivebot and ensure each page is archived appropriately? 
[13:12] <HP_Archiv> Japanese version of Harry Potter 1 Demo https://www.4gamer.net/patch/demo/data/harrypotter.html
[13:12] <HP_Archiv> Topic on Oldunreal Forums. Contains some information and link to official patch (for first game). https://www.oldunreal.com/cgi-bin/yabb2/YaBB.pl?num=1437028616
[13:12] <HP_Archiv> HP1 ScriptSource, headers, precompiled binaries and HTK. https://www.oldunreal.com/cgi-bin/yabb2/YaBB.pl?num=1445541914 https://www.oldunreal.com/cgi-bin/yabb2/YaBB.pl?num=1490313063 http://coding.hanfling.de/launch/
[13:13] <HP_Archiv> Maybe JAA if you have a free moment?
[13:17] <HP_Archiv> There are outlinks to an exe demo for the JP link. And the Oldunreal forums also contain outlinks with direct hosted exe patches. Can you please make sure all of these files are captured?
[13:18] <JAA> HP_Archiv: Launched. Most of those links on the oldunreal.com forums are also referenced directly on http://coding.hanfling.de/launch/ and so should be grabbed by that job. I only saw two extra links that don't appear there.
[13:20] <HP_Archiv> JAA awesome, thank you ^^
[13:26] <HP_Archiv> JAA one other question, I'm trying to find a way to view videos from a YT channel that has since been deleted. WBM crawled certain aspects of this channel, https://web.archive.org/web/20171206015000/https://www.youtube.com/user/Koops1997 
[13:26] <HP_Archiv> Is there any way for me to actually view the videos or even video thumbnails?
[13:28] <JAA> HP_Archiv: No idea. #youtubearchive doesn't have anything for that channel. Maybe someone else does, but it might also be lost.
[13:29] <HP_Archiv> Hmm okay ^^
[13:49] *** jc86035 has joined #archiveteam-bs
[14:05] *** deevious has quit IRC (Quit: deevious)
[14:24] <prq> this wget command has been running for five days now. maybe my --wait=2 --random-wait was a bigger delay than necessary.
[14:26] <JAA> Some of the ArchiveBot jobs have been running for the better part of a year. If the site's reasonably large, that's normal.
[14:27] <prq> yup, I just have some tuning to do. There has to be a way to do a distributed crawl to make a warc (I'm still reading about the archiveteam tools-- that's what the warrior thing appears to be a part of?)
[14:28] <JAA> Yeah, the warrior is part of our distributed archival system. That doesn't work for recursive crawls though. It requires being able to split the job up into individual work items that are largely independent of each other. Good examples are sites that use numeric identifiers and then doing a certain number range as an item.
[14:37] <prq> got it.
[14:41] <prq> the types of stuff I'm interested in archiving (either myself or via archive.org or another service): podcasts, blogs, public "official" corporate type websites that have lots of material, some github projects, youtube(and other) videos/channels, forums, subreddits, and maybe even some facebook groups.
[14:42] <prq> I'd like for archiving forums and reddit posts to include stuff that gets linked to as well. I feel like I'd have to set up a lot of the other pipelines for specific content types first
[14:42] <prq> like if it's a youtube video, submit it to a youtube archiver, otherwise submit it to a regular web capture pipeline.
[14:43] <prq> My guess is that I could easily fill up 100TiB of space in the first year of trying to do this, before adding podcasts and videos.
[14:47] *** omglolbah has quit IRC (Read error: No route to host)
[14:49] *** omglolbah has joined #archiveteam-bs
[14:51] <prq> I am feeling somewhat intimidated by the scope of what I'm hoping to accomplish. :/
[14:51] *** jc86035 has quit IRC (Quit: Leaving.)
[14:58] <raeyulca> The way I handled distributed recursive crawling was to have each crawler check against a database to see if that particular site had already been crawled, or was in the queue to be crawl, and, if not, add it into the queueed
[14:59] <raeyulca> into the queue to be crawled*
[14:59] <JAA> Yeah, I worked on a distributed version of wpull a while ago.
[14:59] <JAA> With the same idea.
[15:00] <JAA> Central DB of URLs, check out a couple, retrieve them, report new ones back, rinse and repeat.
[15:00] <JAA> But distributed recursive crawls can break easily due to cookies and whatnot.
[15:02] <prq> are either of those distributed crawl implementations on the archiveteam github?
[15:04] <prq> I'm reading the wpull github readme and readthedocs now.
[15:04] <JAA> Nope. Mine's on my wpull fork on a separate branch, but it's not usable anyway.
[15:05] <prq> to me, the disadvantage to wget is that my progress is pretty vigorously tied to the lifetime of this one process. since wpull can write its urls to disk, that suggests it may be able to resume if it exits for some reason (at least the architecture would allow for it)
[15:05] <JAA> Yes, except cookies.
[15:06] <prq> seems like cookies could be solved pretty easily as well-- perhaps by storing those in a file (or a durable cache in the case of using more than one instance)
[15:07] <JAA> Well, wpull does have the --save-cookies option to do that, but it's only written if wpull exits cleanly.
[15:07] <prq> ah
[15:07] <JAA> So it doesn't help on a power outage, system crash, etc.
[15:08] <prq> fortunately, I know python, so I feel like I _might_ be able to rework the cookie storage a bit.
[15:08] <JAA> They could be written to the database instead, but that also requires some significant rewrites/code rearrangements. It's been on my todo list of wpull dev work for a long time.
[15:12] *** akierig has joined #archiveteam-bs
[15:12] <prq> I was perusing the archive.org podcasts again, and it looks like there *is* someone who will grab stuff from itunes and post it to archive.org: https://archive.org/details/podcast_secular-stories_1084744946?tab=about done by arkiver2. I wasn't able to find any info about the software they used to grab the podcast and upload it to archive.org
[15:13] *** purplebot has quit IRC (Quit: KeyboardInterrupt)
[15:14] <prq> oh, and the actual mp3 is listed as restricted?
[15:14] *** purplebot has joined #archiveteam-bs
[15:25] *** purplebot has quit IRC (Quit: KeyboardInterrupt)
[15:25] *** purplebot has joined #archiveteam-bs
[15:30] *** odemg has quit IRC (Ping timeout: 745 seconds)
[15:31] *** odemg has joined #archiveteam-bs
[15:32] <godane> SketchCow: i'm uploading 3 issues of Computer Magazine by IEEE Computer Society
[15:32] <godane> 3 issues done now 74 issues more to go
[15:32] <godane> the jpeg compression is up to 95 vs before it was at 90
[15:52] *** jc86035 has joined #archiveteam-bs
[15:58] <godane> https://archive.org/details/computer-magazine-1986-11
[15:58] <godane> https://archive.org/details/computer-magazine-1991-07
[15:58] <godane> https://archive.org/details/computer-magazine-1992-03
[16:15] <kpcyrd> raeyulca: nice! are you still around?
[16:26] *** HP_Archiv has quit IRC (Ping timeout: 260 seconds)
[16:27] *** HP_Archiv has joined #archiveteam-bs
[16:33] *** icedice has quit IRC (Quit: Leaving)
[16:58] <kpcyrd> where do I request humorous irc channel names?
[16:58] <kiska> Here? Or in -ot :D
[17:02] <kpcyrd> I've been thinking about using #clockwise for tiktok, but wondering if somebody has a better idea
[17:02] <astrid> i like clokwise (no 2nd c), heh
[17:03] <astrid> or klokwise
[17:03] <astrid> its a good channel name imo
[17:03] <astrid> might be a little bit unclear what it's for
[17:04] <kpcyrd> a lot of channel names seem to be related to a specific shutdown
[17:05] <kpcyrd> this is more of a pro-active project (and somebody apparently already did all the work)
[17:05] <astrid> right but it's for a particular site
[17:06] <astrid> weve definitely had long running channels for sites that arent dead ... yet
[17:06] <kpcyrd> which is kind of the next question, who do I talk to so we're getting the data into internet archive pipelines?
[17:06] <astrid> most of the people in here with an @ have some idea how to accomplish that
[17:10] *** Jens has quit IRC (Remote host closed the connection)
[17:11] *** Jens has joined #archiveteam-bs
[17:15] *** systwi_ has quit IRC (Read error: Connection reset by peer)
[17:15] *** phillipsj has quit IRC (Remote host closed the connection)
[17:16] *** HashbangI has quit IRC (Read error: Operation timed out)
[17:16] *** jake_test has quit IRC (Read error: Operation timed out)
[17:17] *** wyatt8740 has quit IRC (Ceci n'est pas un IRC quit message.)
[17:17] *** wyatt8740 has joined #archiveteam-bs
[17:19] *** phillipsj has joined #archiveteam-bs
[17:19] *** PhrackD has quit IRC (Remote host closed the connection)
[17:21] <markedL> #TikOff
[17:22] <markedL> or #archiveteamTickedOff
[17:22] <markedL> ^TickedOff
[17:22] *** kode54 has quit IRC (Remote host closed the connection)
[17:55] *** akierig has quit IRC (Read error: Operation timed out)
[17:59] *** akierig has joined #archiveteam-bs
[18:03] *** jc86035 has quit IRC (Quit: Leaving.)
[18:07] <Raccoon> former imho
[18:17] <anarcat> https://github.com/turicas/crau
[18:17] <anarcat> ^python + scrapy based crawler, with WARC support
[18:19] <JAA> That seems like it won't preserve transfer encoding.
[18:20] <JAA> And apparently redirects might not get written either.
[18:20] <JAA> Aka do not use for actual archival until it's more stable.
[18:20] <anarcat> thanks for the quick review JAA :)
[18:20] <astrid> it'd be cool to have an "archiveteam warc-creator tool conformance test"
[18:21] <JAA> Coming soon.
[18:21] <astrid> !!
[18:21] <JAA> wumpus has been working on a general standards conformity test, and I've started working on a transfer encoding comparison a bit ago.
[18:22] <astrid> !!!
[18:22] <JAA> There are definitely other things that need to be tested as well though, e.g. whether headers are preserved correctly (no whitespace stripping, case normalisation, etc.).
[18:22] <astrid> ive not heard of this before and you just made my day!
[18:23] <JAA> :-)
[18:23] <JAA> Also, for checking digests, https://github.com/JustAnotherArchivist/little-things/blob/master/warc-tiny has a "verify" mode.
[18:24] <JAA> (Spoiler: almost all tools write WARCs that don't conform to the standard in how the payload digests of chunked responses are calculated.)
[18:24] <JAA> "Almost all" meaning that several major ones don't, and I've only seen one that does. That's what this TE comparison is about.
[18:25] <JAA> I just need to find the time to actually launch that comparison project, reach out to all the tool authors, etc.
[18:25] <JAA> And then there'll be a discussion about whether the tools or the standard need to be fixed.
[18:28] <astrid> heart eyes emoji, thank you for all the excellent work you do
[18:29] <anarcat> JAA: is the "transfer encoding" bug filed? i see redirects might be https://github.com/turicas/crau/issues/1
[18:30] <JAA> anarcat: Doesn't seem to be.
[18:30] <anarcat> JAA: and the redirect bug is something you confirmed? i would mention it (and you) in #1
[18:30] <anarcat> i could open the other bug as well, mentioning you, if you wouldn't
[18:30] <JAA> No, I haven't verified anything, just scanned through the code to see how stuff is being fetched and written to WARC.
[18:31] <JAA> Which is also why I'm a bit hesitant to open the issue.
[18:31] <anarcat> i see
[18:32] <JAA> But I'm pretty sure because this is how the record is being written: https://github.com/turicas/crau/blob/6833645067967471e530176fe667b50ebf7839f7/crau/utils.py#L184
[18:32] <JAA> response.body will definitely not contain TE.
[18:33] <JAA> Or at least I highly doubt it, because noone using Scrapy normally would expect that.
[18:34] <JAA> astrid: Thanks for the kind words. <3
[18:36] <JAA> anarcat: I'll open an issue later.
[18:37] *** akierig_ has joined #archiveteam-bs
[18:38] *** akierig__ has joined #archiveteam-bs
[18:41] <anarcat> JAA: cool thanks!
[18:44] *** akierig has quit IRC (Read error: Operation timed out)
[18:44] *** akierig_ has quit IRC (Read error: Operation timed out)
[18:47] *** PaulW has joined #archiveteam-bs
[18:47] <JAA> Spotted another issue: headers are also not preserved exactly as sent by the server.
[18:47] <anarcat> dum dum dum!
[18:47] <anarcat> the plot thickens!
[18:48] <JAA> There is at least a comment about that in the code: https://github.com/turicas/crau/blob/6833645067967471e530176fe667b50ebf7839f7/crau/utils.py#L169-L170
[18:48] <JAA> Well, part of it, anyway.
[18:48] <JAA> The fields might get mangled, normalised, or order changed as well, depending on which Python version you run it on etc.
[18:49] <JAA> (Unless Scrapy uses an OrderedDict for the headers)
[18:51] *** PaulW has quit IRC (Ping timeout: 260 seconds)
[18:51] *** akierig__ has quit IRC (Quit: later_gator)
[19:15] *** akierig has joined #archiveteam-bs
[19:48] *** jc86035 has joined #archiveteam-bs
[19:51] <kpcyrd> JAA: a header can be present multiple times, so you'd usually need a list anyway
[20:06] <kpcyrd> is there an mediawiki wizard who knows how to do conditions in templates?
[20:06] <kpcyrd> I'd like to extend it with support for multiple networks without breaking existing efnet links
[20:09] <jc86035> kpcyrd: As a Wikipedia editor, I might be able to help, I suppose?
[20:15] <kpcyrd> jc86035: the current template is https://www.archiveteam.org/index.php?title=Template:IRC&action=edit
[20:16] <kpcyrd> what I'm trying to do is either a suffix or a prefix that I test for
[20:16] <jc86035> WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD 
[20:16] <jc86035> (I don't have an account on the wiki)
[20:17] <jc86035> > To protect the wiki against automated account creation, we kindly ask you to answer the question that appears below (https://www.archiveteam.org/index.php?title=Special:Captcha/help):
[20:17] <jc86035> Visit the Archive Team IRC channel (#archiveteam on the EFnet network) and ask for the secret word. Ask for it by typing WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD 
[20:18] <kpcyrd> I'm supposed to ask what your quest is, but I guess I already know
[20:19] <jc86035> well, I was going to try to fix {{irc}} (as you know)
[20:20] <jc86035> kpcyrd: how will I receive the secret word?
[20:21] <JAA> kpcyrd: You might need a list after parsing for continuing to work with the response, but the data written to the WARC should simply be the bytes sent by the server with no processing whatsoever (minus TCP/IP layers etc. of course).
[20:21] <JAA> And the data written to the WARC should most definitely not be constructed with an f-string using the parsed response data.
[20:22] <jc86035> kpcyrd: For the template I'd start with something like {{#switch:{{{1|}}}|key=value|key2=value2|default}}; would you want the switch to be based on a second parameter or on the server name?
[20:23] <jc86035> The most obvious options are either (1) manually specifying the IRC host or (2) having a preset list of channel names that are non-EFNet
[20:29] *** tech234a has joined #archiveteam-bs
[20:29] *** X-Scale` has joined #archiveteam-bs
[20:38] *** X-Scale has quit IRC (Read error: Operation timed out)
[20:38] *** X-Scale` is now known as X-Scale
[20:39] *** TC01_ has quit IRC (Ping timeout: 252 seconds)
[20:40] *** TC01 has joined #archiveteam-bs
[20:43] <jc86035> I have been noticed
[20:58] <JAA> Maybe {{IRC|channel|hackint}} should also include "hackint" in the output, at least for as long as EFnet remains our primary network. Many people won't click on that link to join but enter the channel name in their client, so they wouldn't see that it's a different network.
[20:58] <JAA> E.g. "#channel (hackint)" with #channel being linked.
[20:59] <JAA> anarcat: Issues filed.
[20:59] <anarcat> JAA: awesome
[21:00] <JAA> I'm sure there are more issues in the code, but with these obvious ones, I'm not going to dig deeper now.
[21:01] <jc86035> JAA: I've added the code for that; it looks like it's working
[21:02] <JAA> Looks good, thanks!
[21:04] <kpcyrd> huge shoutout to jc86035! :)
[21:04] <jc86035> :)
[21:05] <JAA> Also reminds me that I wanted to revamp Template:Infobox_project a long time ago.
[21:07] <JAA> Maybe I'll take another stab at that later or tomorrow.
[21:16] <jc86035> I've added links to four of the undocumented APIs used in the Wayback Machine website. I'm not sure if anyone knows about them, but I've found them quite useful (especially for re-archiving URLs that were last archived years ago).
[21:17] <jc86035> The timemap API is by far the most useful one, I think. In the website implementation it's limited to 100,000 results, but if you take the request URL and increase the limit you can easily get it to millions of rows of JSON.
[22:03] *** Mateon1 has quit IRC (Read error: Connection reset by peer)
[22:03] *** Mateon1 has joined #archiveteam-bs
[22:08] <jc86035> (continuing from yesterday…) how would one use a cron job or similar to feed archivebot at regular intervals? could it go through IRC? could a third-party server just send web requests to it directly? or would it have to be set up internally?
[22:10] <jc86035> astrid: would these things work? or is there something else you/others had in mind for doing this?
[22:10] *** akierig has quit IRC (Quit: later_gator)
[22:29] <JAA> Er, that could be done, but I'd say it'd be better to run it somewhere else with grab-site or a more specialised tool (depending on the needs) and then upload the WARCs to IA.
[22:29] <JAA> But yes, anything with AB would have to go through IRC.
[22:34] *** X-Scale` has joined #archiveteam-bs
[22:37] *** X-Scale has quit IRC (Read error: Operation timed out)
[22:37] *** X-Scale` is now known as X-Scale
[22:40] <jc86035> JAA: If you upload WARCs using the standard IA upload form, do they end up in the Wayback Machine or are they just treated like other arbitrary files?
[22:42] <JAA> jc86035: If you upload it correctly (with mediatype "web") *and* your account is whitelisted, it goes into the WBM.
[22:43] <jc86035> JAA: Makes sense, do IA staff typically grant most requests to be whitelisted?
[22:43] <JAA> Beyond a certain, quite small (in AT context) amount of data, you won't want to use the upload form anyway. 'ia upload' is king.
[22:44] <JAA> I have no idea what the requirements are for getting whitelisted.
[22:44] <JAA> I think they manually verify that the data is good and not tampered with etc.
[22:57] <jc86035> JAA: I'm not sure where I'd start with it; would I just install grab-site on a server instance, collect data and ask to be whitelisted? would it be better to host such a thing on archiveteam infrastructure? (a lot of my archival stuff runs on Wikimedia Toolforge even though it doesn't really directly benefit the Wikimedia sites, so it would probably be more appropriate to host those scripts somewhere else)
[22:58] <JAA> Yeah, definitely not on Toolforge since this is going to generate a lot more traffic than the /save/ requests you did before.
[22:59] <JAA> "ArchiveTeam infrastructure" as such doesn't really exist; it's just random people's machines dedicated to some purpose. If you have a server and use it for AT, it becomes "AT infra". :-)
[22:59] <jc86035> It would definitely increase traffic (a lot of my older scripts didn't even archive images due to being xargs-based) but for a lot of them I'd probably limit it to one or two recursions
[22:59] <JAA> For the same reason, we also don't have idle resources lying around really. Machines exist to be used around here.
[23:00] <jc86035> I don't actually have any spare servers lying around, right now it's just my laptop (and I probably wouldn't want to run grab-site on it)
[23:00] <JAA> So basically yeah, either get a server somewhere or find someone willing to run it on a server of theirs (or for you).
[23:01] <JAA> Sigh, jrwr, your bot still breaks when warriorhq isn't reachable. :-/
[23:03] <jc86035> I guess I could just keep using Toolforge to make lots of requests to the new Save Page Now, if I don't figure anything else out. Right now it's unnecesarily ad hoc, some of it still goes through via.hypothes.is because of an issue I had with the IA servers
[23:04] <jc86035> I haven't done any serious maintenance for several months though
[23:04] *** icedice has joined #archiveteam-bs
[23:04] <jc86035> * to the old Save Page Now
[23:06] *** jc86035 has quit IRC (Quit: Leaving.)
[23:10] *** BlueMax has joined #archiveteam-bs
[23:22] <markedL> I'm interested in setting up a scheduled cron like grab, just we have a fire drill projects at the moment 
[23:29] *** tech234a has quit IRC (Quit: Connection closed for inactivity)
[23:45] *** killsushi has joined #archiveteam-bs