#archiveteam-bs 2019-11-13,Wed

↑back Search

Time Nickname Message
00:02 πŸ”— icedice has joined #archiveteam-bs
00:02 πŸ”— robogoat has quit IRC (Read error: Operation timed out)
00:11 πŸ”— manjaro-u has quit IRC (Ping timeout: 258 seconds)
00:18 πŸ”— BartoCH has joined #archiveteam-bs
00:51 πŸ”— BlueMax has quit IRC (Quit: Leaving)
01:18 πŸ”— Video has quit IRC (Read error: Operation timed out)
01:25 πŸ”— sec^nd has quit IRC (Ping timeout: 745 seconds)
01:25 πŸ”— odemgi has joined #archiveteam-bs
01:28 πŸ”— odemgi_ has quit IRC (Read error: Connection reset by peer)
01:35 πŸ”— RichardG_ has quit IRC (Keyboard not found, press F1 to continue)
01:39 πŸ”— RichardG has joined #archiveteam-bs
02:24 πŸ”— britmob has joined #archiveteam-bs
02:24 πŸ”— BlueMax has joined #archiveteam-bs
03:11 πŸ”— bluefoo has quit IRC (Read error: Connection reset by peer)
03:32 πŸ”— asdf0101 has quit IRC (The Lounge - https://thelounge.chat)
03:32 πŸ”— markedL has quit IRC (Quit: The Lounge - https://thelounge.chat)
03:40 πŸ”— asdf0101 has joined #archiveteam-bs
03:40 πŸ”— markedL has joined #archiveteam-bs
04:00 πŸ”— markedH has quit IRC (Read error: Operation timed out)
04:29 πŸ”— odemgi_ has joined #archiveteam-bs
04:32 πŸ”— odemgi has quit IRC (Read error: Operation timed out)
04:34 πŸ”— qw3rty2 has joined #archiveteam-bs
04:41 πŸ”— qw3rty has quit IRC (Ping timeout: 745 seconds)
05:05 πŸ”— second has joined #archiveteam-bs
05:27 πŸ”— kiska18 has quit IRC (Remote host closed the connection)
05:27 πŸ”— Ryz has quit IRC (Remote host closed the connection)
05:27 πŸ”— kiska18 has joined #archiveteam-bs
05:27 πŸ”— Fusl__ sets mode: +o kiska18
05:27 πŸ”— Fusl sets mode: +o kiska18
05:27 πŸ”— Fusl_ sets mode: +o kiska18
05:27 πŸ”— Ryz has joined #archiveteam-bs
05:33 πŸ”— eythian has quit IRC (Remote host closed the connection)
05:45 πŸ”— raeyulca has joined #archiveteam-bs
05:47 πŸ”— eythian has joined #archiveteam-bs
05:48 πŸ”— raeyulca Hello, everyone. Someone told me about you guys - I have been archiving TikTok and heard that you may be interested.
06:09 πŸ”— markedH has joined #archiveteam-bs
06:12 πŸ”— markedL Someone was talking about that site the other day. kpcyrd ^
06:16 πŸ”— kiskabak has quit IRC (Read error: Operation timed out)
06:17 πŸ”— jake_test has quit IRC (Read error: Operation timed out)
06:19 πŸ”— Panasonic has joined #archiveteam-bs
06:19 πŸ”— fredgido has joined #archiveteam-bs
06:19 πŸ”— markedH has quit IRC (Read error: Operation timed out)
06:19 πŸ”— jake_test has joined #archiveteam-bs
06:19 πŸ”— odemgi has joined #archiveteam-bs
06:20 πŸ”— Damme has joined #archiveteam-bs
06:20 πŸ”— odemgi has quit IRC (Read error: Connection reset by peer)
06:20 πŸ”— odemgi has joined #archiveteam-bs
06:20 πŸ”— markedL So what's the storage consumed by 23.5M videos?
06:20 πŸ”— systwi_ has joined #archiveteam-bs
06:21 πŸ”— Raccoon has quit IRC (Read error: Connection reset by peer)
06:21 πŸ”— benjinsmi has quit IRC (Read error: Operation timed out)
06:22 πŸ”— Raccoon has joined #archiveteam-bs
06:23 πŸ”— Damme_ has quit IRC (Read error: Operation timed out)
06:23 πŸ”— benjins has joined #archiveteam-bs
06:23 πŸ”— fredgido_ has quit IRC (Read error: Operation timed out)
06:23 πŸ”— Ravenloft has quit IRC (Read error: Operation timed out)
06:24 πŸ”— markedH has joined #archiveteam-bs
06:24 πŸ”— Mayonaise has quit IRC (Read error: Operation timed out)
06:26 πŸ”— raeyulca ~55TB
06:27 πŸ”— Mayonaise has joined #archiveteam-bs
06:27 πŸ”— odemgi_ has quit IRC (Ping timeout: 612 seconds)
06:28 πŸ”— raeyulca wait nvm, 64TB
06:29 πŸ”— systwi has quit IRC (Ping timeout: 612 seconds)
06:30 πŸ”— jake_test has quit IRC (Ping timeout: 492 seconds)
06:31 πŸ”— jake_test has joined #archiveteam-bs
06:45 πŸ”— SketchCow You have 64tb of videos?
06:49 πŸ”— raeyulca yes
06:55 πŸ”— luckcolor has quit IRC (Read error: Operation timed out)
06:58 πŸ”— luckcolor has joined #archiveteam-bs
07:03 πŸ”— chfoo_ is now known as chfoo
07:10 πŸ”— phillipsj has quit IRC (Ping timeout: 252 seconds)
07:16 πŸ”— phillipsj has joined #archiveteam-bs
07:16 πŸ”— robogoat has joined #archiveteam-bs
07:22 πŸ”— markedL I haven't looked into the site until literally just now. Looks like the ID's are really large numbers. How did you do discovery? If you can stick around for a few hours, others with more informed questions will start strolling in for the day
07:25 πŸ”— raeyulca I'm PST so I'll be going to bed very shortly.
07:25 πŸ”— raeyulca >How did you do discovery?
07:25 πŸ”— raeyulca I reverse-engineered the web API (the mobile API proved to be mostly too difficult)
07:27 πŸ”— bluefoo has joined #archiveteam-bs
07:28 πŸ”— raeyulca I have not found any correlation between the video IDs and anything I could think to be correlated (user Id, time of upload, etc). That could be solved with ML but I didn't think it was worth it, when the API just gives you valid video IDs
07:28 πŸ”— raeyulca Something to note about video ID urls: here's an example
07:29 πŸ”— raeyulca https://www.tiktok.com/@turkeycomedy/video/6758661713001745670?langCountry=en
07:30 πŸ”— raeyulca This is not valid. The video "6758661713001745670" does not belong to @turkeycomedy, but actually belongs to @nicoleutyro
07:30 πŸ”— raeyulca but the URL will resolve anyways if the ID is valid
07:36 πŸ”— markedL yeah, US pacific form the last hold outs. what percentage do you estimate 23.5million covers?
07:38 πŸ”— markedL I've heard of people decompiling other mobile apps to get API keys, but if there's a web site, I agree it's not as worth it.
07:39 πŸ”— raeyulca I don't know the answer to that question. It's 146k individual users. I (ab)use the "recommended" feature, which I'm assuming uses some sort hand-wavy algorithm to give "similar" users, and I maybe only get 10 or so new users a day
07:40 πŸ”— raeyulca Decompiling the app to get an API key is not possible in this case
07:40 πŸ”— raeyulca It uses a time-based key ("X-Gorgon") in the header, to verify all API requests. It does some bitwise manipulation on the current unix timestamp in MS and some startup params, but I was never able to get farther than that.
07:43 πŸ”— m007a83_ has joined #archiveteam-bs
07:43 πŸ”— m007a83_ has quit IRC (Connection closed)
07:45 πŸ”— markedL fancy. this sounds interesting. if we get more interest we'll form a dedicated channel. there's a wiki page but just started, https://www.archiveteam.org/index.php?title=TikTok
07:47 πŸ”— m007a83 has quit IRC (Ping timeout: 252 seconds)
07:48 πŸ”— m007a83 has joined #archiveteam-bs
07:59 πŸ”— m007a83 has quit IRC (Quit: Fuck you Comcast)
09:38 πŸ”— omglolbah has quit IRC (Quit: ZNC - https://znc.in)
09:49 πŸ”— Flashfire has quit IRC (Remote host closed the connection)
09:49 πŸ”— kiska has quit IRC (Remote host closed the connection)
09:50 πŸ”— kiska has joined #archiveteam-bs
09:50 πŸ”— Fusl__ sets mode: +o kiska
09:50 πŸ”— Fusl sets mode: +o kiska
09:50 πŸ”— Fusl_ sets mode: +o kiska
09:50 πŸ”— Flashfire has joined #archiveteam-bs
10:26 πŸ”— omglolbah has joined #archiveteam-bs
11:23 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
11:46 πŸ”— odemg has joined #archiveteam-bs
12:36 πŸ”— markedH has quit IRC (Leaving)
13:10 πŸ”— HP_Archiv has joined #archiveteam-bs
13:12 πŸ”— HP_Archiv Hey everyone, I have a few site requests for archiving. Can someone please throw this into archivebot and ensure each page is archived appropriately?
13:12 πŸ”— HP_Archiv Japanese version of Harry Potter 1 Demo https://www.4gamer.net/patch/demo/data/harrypotter.html
13:12 πŸ”— HP_Archiv Topic on Oldunreal Forums. Contains some information and link to official patch (for first game). https://www.oldunreal.com/cgi-bin/yabb2/YaBB.pl?num=1437028616
13:12 πŸ”— HP_Archiv HP1 ScriptSource, headers, precompiled binaries and HTK. https://www.oldunreal.com/cgi-bin/yabb2/YaBB.pl?num=1445541914 https://www.oldunreal.com/cgi-bin/yabb2/YaBB.pl?num=1490313063 http://coding.hanfling.de/launch/
13:13 πŸ”— HP_Archiv Maybe JAA if you have a free moment?
13:17 πŸ”— HP_Archiv There are outlinks to an exe demo for the JP link. And the Oldunreal forums also contain outlinks with direct hosted exe patches. Can you please make sure all of these files are captured?
13:18 πŸ”— JAA HP_Archiv: Launched. Most of those links on the oldunreal.com forums are also referenced directly on http://coding.hanfling.de/launch/ and so should be grabbed by that job. I only saw two extra links that don't appear there.
13:20 πŸ”— HP_Archiv JAA awesome, thank you ^^
13:26 πŸ”— HP_Archiv JAA one other question, I'm trying to find a way to view videos from a YT channel that has since been deleted. WBM crawled certain aspects of this channel, https://web.archive.org/web/20171206015000/https://www.youtube.com/user/Koops1997
13:26 πŸ”— HP_Archiv Is there any way for me to actually view the videos or even video thumbnails?
13:28 πŸ”— JAA HP_Archiv: No idea. #youtubearchive doesn't have anything for that channel. Maybe someone else does, but it might also be lost.
13:29 πŸ”— HP_Archiv Hmm okay ^^
13:49 πŸ”— jc86035 has joined #archiveteam-bs
14:05 πŸ”— deevious has quit IRC (Quit: deevious)
14:24 πŸ”— prq this wget command has been running for five days now. maybe my --wait=2 --random-wait was a bigger delay than necessary.
14:26 πŸ”— JAA Some of the ArchiveBot jobs have been running for the better part of a year. If the site's reasonably large, that's normal.
14:27 πŸ”— prq yup, I just have some tuning to do. There has to be a way to do a distributed crawl to make a warc (I'm still reading about the archiveteam tools-- that's what the warrior thing appears to be a part of?)
14:28 πŸ”— JAA Yeah, the warrior is part of our distributed archival system. That doesn't work for recursive crawls though. It requires being able to split the job up into individual work items that are largely independent of each other. Good examples are sites that use numeric identifiers and then doing a certain number range as an item.
14:37 πŸ”— prq got it.
14:41 πŸ”— prq the types of stuff I'm interested in archiving (either myself or via archive.org or another service): podcasts, blogs, public "official" corporate type websites that have lots of material, some github projects, youtube(and other) videos/channels, forums, subreddits, and maybe even some facebook groups.
14:42 πŸ”— prq I'd like for archiving forums and reddit posts to include stuff that gets linked to as well. I feel like I'd have to set up a lot of the other pipelines for specific content types first
14:42 πŸ”— prq like if it's a youtube video, submit it to a youtube archiver, otherwise submit it to a regular web capture pipeline.
14:43 πŸ”— prq My guess is that I could easily fill up 100TiB of space in the first year of trying to do this, before adding podcasts and videos.
14:47 πŸ”— omglolbah has quit IRC (Read error: No route to host)
14:49 πŸ”— omglolbah has joined #archiveteam-bs
14:51 πŸ”— prq I am feeling somewhat intimidated by the scope of what I'm hoping to accomplish. :/
14:51 πŸ”— jc86035 has quit IRC (Quit: Leaving.)
14:58 πŸ”— raeyulca The way I handled distributed recursive crawling was to have each crawler check against a database to see if that particular site had already been crawled, or was in the queue to be crawl, and, if not, add it into the queueed
14:59 πŸ”— raeyulca into the queue to be crawled*
14:59 πŸ”— JAA Yeah, I worked on a distributed version of wpull a while ago.
14:59 πŸ”— JAA With the same idea.
15:00 πŸ”— JAA Central DB of URLs, check out a couple, retrieve them, report new ones back, rinse and repeat.
15:00 πŸ”— JAA But distributed recursive crawls can break easily due to cookies and whatnot.
15:02 πŸ”— prq are either of those distributed crawl implementations on the archiveteam github?
15:04 πŸ”— prq I'm reading the wpull github readme and readthedocs now.
15:04 πŸ”— JAA Nope. Mine's on my wpull fork on a separate branch, but it's not usable anyway.
15:05 πŸ”— prq to me, the disadvantage to wget is that my progress is pretty vigorously tied to the lifetime of this one process. since wpull can write its urls to disk, that suggests it may be able to resume if it exits for some reason (at least the architecture would allow for it)
15:05 πŸ”— JAA Yes, except cookies.
15:06 πŸ”— prq seems like cookies could be solved pretty easily as well-- perhaps by storing those in a file (or a durable cache in the case of using more than one instance)
15:07 πŸ”— JAA Well, wpull does have the --save-cookies option to do that, but it's only written if wpull exits cleanly.
15:07 πŸ”— prq ah
15:07 πŸ”— JAA So it doesn't help on a power outage, system crash, etc.
15:08 πŸ”— prq fortunately, I know python, so I feel like I _might_ be able to rework the cookie storage a bit.
15:08 πŸ”— JAA They could be written to the database instead, but that also requires some significant rewrites/code rearrangements. It's been on my todo list of wpull dev work for a long time.
15:12 πŸ”— akierig has joined #archiveteam-bs
15:12 πŸ”— prq I was perusing the archive.org podcasts again, and it looks like there *is* someone who will grab stuff from itunes and post it to archive.org: https://archive.org/details/podcast_secular-stories_1084744946?tab=about done by arkiver2. I wasn't able to find any info about the software they used to grab the podcast and upload it to archive.org
15:13 πŸ”— purplebot has quit IRC (Quit: KeyboardInterrupt)
15:14 πŸ”— prq oh, and the actual mp3 is listed as restricted?
15:14 πŸ”— purplebot has joined #archiveteam-bs
15:25 πŸ”— purplebot has quit IRC (Quit: KeyboardInterrupt)
15:25 πŸ”— purplebot has joined #archiveteam-bs
15:30 πŸ”— odemg has quit IRC (Ping timeout: 745 seconds)
15:31 πŸ”— odemg has joined #archiveteam-bs
15:32 πŸ”— godane SketchCow: i'm uploading 3 issues of Computer Magazine by IEEE Computer Society
15:32 πŸ”— godane 3 issues done now 74 issues more to go
15:32 πŸ”— godane the jpeg compression is up to 95 vs before it was at 90
15:52 πŸ”— jc86035 has joined #archiveteam-bs
15:58 πŸ”— godane https://archive.org/details/computer-magazine-1986-11
15:58 πŸ”— godane https://archive.org/details/computer-magazine-1991-07
15:58 πŸ”— godane https://archive.org/details/computer-magazine-1992-03
16:15 πŸ”— kpcyrd raeyulca: nice! are you still around?
16:26 πŸ”— HP_Archiv has quit IRC (Ping timeout: 260 seconds)
16:27 πŸ”— HP_Archiv has joined #archiveteam-bs
16:33 πŸ”— icedice has quit IRC (Quit: Leaving)
16:58 πŸ”— kpcyrd where do I request humorous irc channel names?
16:58 πŸ”— kiska Here? Or in -ot :D
17:02 πŸ”— kpcyrd I've been thinking about using #clockwise for tiktok, but wondering if somebody has a better idea
17:02 πŸ”— astrid i like clokwise (no 2nd c), heh
17:03 πŸ”— astrid or klokwise
17:03 πŸ”— astrid its a good channel name imo
17:03 πŸ”— astrid might be a little bit unclear what it's for
17:04 πŸ”— kpcyrd a lot of channel names seem to be related to a specific shutdown
17:05 πŸ”— kpcyrd this is more of a pro-active project (and somebody apparently already did all the work)
17:05 πŸ”— astrid right but it's for a particular site
17:06 πŸ”— astrid weve definitely had long running channels for sites that arent dead ... yet
17:06 πŸ”— kpcyrd which is kind of the next question, who do I talk to so we're getting the data into internet archive pipelines?
17:06 πŸ”— astrid most of the people in here with an @ have some idea how to accomplish that
17:10 πŸ”— Jens has quit IRC (Remote host closed the connection)
17:11 πŸ”— Jens has joined #archiveteam-bs
17:15 πŸ”— systwi_ has quit IRC (Read error: Connection reset by peer)
17:15 πŸ”— phillipsj has quit IRC (Remote host closed the connection)
17:16 πŸ”— HashbangI has quit IRC (Read error: Operation timed out)
17:16 πŸ”— jake_test has quit IRC (Read error: Operation timed out)
17:17 πŸ”— wyatt8740 has quit IRC (Ceci n'est pas un IRC quit message.)
17:17 πŸ”— wyatt8740 has joined #archiveteam-bs
17:19 πŸ”— phillipsj has joined #archiveteam-bs
17:19 πŸ”— PhrackD has quit IRC (Remote host closed the connection)
17:21 πŸ”— markedL #TikOff
17:22 πŸ”— markedL or #archiveteamTickedOff
17:22 πŸ”— markedL ^TickedOff
17:22 πŸ”— kode54 has quit IRC (Remote host closed the connection)
17:55 πŸ”— akierig has quit IRC (Read error: Operation timed out)
17:59 πŸ”— akierig has joined #archiveteam-bs
18:03 πŸ”— jc86035 has quit IRC (Quit: Leaving.)
18:07 πŸ”— Raccoon former imho
18:17 πŸ”— anarcat https://github.com/turicas/crau
18:17 πŸ”— anarcat ^python + scrapy based crawler, with WARC support
18:19 πŸ”— JAA That seems like it won't preserve transfer encoding.
18:20 πŸ”— JAA And apparently redirects might not get written either.
18:20 πŸ”— JAA Aka do not use for actual archival until it's more stable.
18:20 πŸ”— anarcat thanks for the quick review JAA :)
18:20 πŸ”— astrid it'd be cool to have an "archiveteam warc-creator tool conformance test"
18:21 πŸ”— JAA Coming soon.
18:21 πŸ”— astrid !!
18:21 πŸ”— JAA wumpus has been working on a general standards conformity test, and I've started working on a transfer encoding comparison a bit ago.
18:22 πŸ”— astrid !!!
18:22 πŸ”— JAA There are definitely other things that need to be tested as well though, e.g. whether headers are preserved correctly (no whitespace stripping, case normalisation, etc.).
18:22 πŸ”— astrid ive not heard of this before and you just made my day!
18:23 πŸ”— JAA :-)
18:23 πŸ”— JAA Also, for checking digests, https://github.com/JustAnotherArchivist/little-things/blob/master/warc-tiny has a "verify" mode.
18:24 πŸ”— JAA (Spoiler: almost all tools write WARCs that don't conform to the standard in how the payload digests of chunked responses are calculated.)
18:24 πŸ”— JAA "Almost all" meaning that several major ones don't, and I've only seen one that does. That's what this TE comparison is about.
18:25 πŸ”— JAA I just need to find the time to actually launch that comparison project, reach out to all the tool authors, etc.
18:25 πŸ”— JAA And then there'll be a discussion about whether the tools or the standard need to be fixed.
18:28 πŸ”— astrid heart eyes emoji, thank you for all the excellent work you do
18:29 πŸ”— anarcat JAA: is the "transfer encoding" bug filed? i see redirects might be https://github.com/turicas/crau/issues/1
18:30 πŸ”— JAA anarcat: Doesn't seem to be.
18:30 πŸ”— anarcat JAA: and the redirect bug is something you confirmed? i would mention it (and you) in #1
18:30 πŸ”— anarcat i could open the other bug as well, mentioning you, if you wouldn't
18:30 πŸ”— JAA No, I haven't verified anything, just scanned through the code to see how stuff is being fetched and written to WARC.
18:31 πŸ”— JAA Which is also why I'm a bit hesitant to open the issue.
18:31 πŸ”— anarcat i see
18:32 πŸ”— JAA But I'm pretty sure because this is how the record is being written: https://github.com/turicas/crau/blob/6833645067967471e530176fe667b50ebf7839f7/crau/utils.py#L184
18:32 πŸ”— JAA response.body will definitely not contain TE.
18:33 πŸ”— JAA Or at least I highly doubt it, because noone using Scrapy normally would expect that.
18:34 πŸ”— JAA astrid: Thanks for the kind words. <3
18:36 πŸ”— JAA anarcat: I'll open an issue later.
18:37 πŸ”— akierig_ has joined #archiveteam-bs
18:38 πŸ”— akierig__ has joined #archiveteam-bs
18:41 πŸ”— anarcat JAA: cool thanks!
18:44 πŸ”— akierig has quit IRC (Read error: Operation timed out)
18:44 πŸ”— akierig_ has quit IRC (Read error: Operation timed out)
18:47 πŸ”— PaulW has joined #archiveteam-bs
18:47 πŸ”— JAA Spotted another issue: headers are also not preserved exactly as sent by the server.
18:47 πŸ”— anarcat dum dum dum!
18:47 πŸ”— anarcat the plot thickens!
18:48 πŸ”— JAA There is at least a comment about that in the code: https://github.com/turicas/crau/blob/6833645067967471e530176fe667b50ebf7839f7/crau/utils.py#L169-L170
18:48 πŸ”— JAA Well, part of it, anyway.
18:48 πŸ”— JAA The fields might get mangled, normalised, or order changed as well, depending on which Python version you run it on etc.
18:49 πŸ”— JAA (Unless Scrapy uses an OrderedDict for the headers)
18:51 πŸ”— PaulW has quit IRC (Ping timeout: 260 seconds)
18:51 πŸ”— akierig__ has quit IRC (Quit: later_gator)
19:15 πŸ”— akierig has joined #archiveteam-bs
19:48 πŸ”— jc86035 has joined #archiveteam-bs
19:51 πŸ”— kpcyrd JAA: a header can be present multiple times, so you'd usually need a list anyway
20:06 πŸ”— kpcyrd is there an mediawiki wizard who knows how to do conditions in templates?
20:06 πŸ”— kpcyrd I'd like to extend it with support for multiple networks without breaking existing efnet links
20:09 πŸ”— jc86035 kpcyrd: As a Wikipedia editor, I might be able to help, I suppose?
20:15 πŸ”— kpcyrd jc86035: the current template is https://www.archiveteam.org/index.php?title=Template:IRC&action=edit
20:16 πŸ”— kpcyrd what I'm trying to do is either a suffix or a prefix that I test for
20:16 πŸ”— jc86035 WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
20:16 πŸ”— jc86035 (I don't have an account on the wiki)
20:17 πŸ”— jc86035 > To protect the wiki against automated account creation, we kindly ask you to answer the question that appears below (https://www.archiveteam.org/index.php?title=Special:Captcha/help):
20:17 πŸ”— jc86035 Visit the Archive Team IRC channel (#archiveteam on the EFnet network) and ask for the secret word. Ask for it by typing WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
20:18 πŸ”— kpcyrd I'm supposed to ask what your quest is, but I guess I already know
20:19 πŸ”— jc86035 well, I was going to try to fix {{irc}} (as you know)
20:20 πŸ”— jc86035 kpcyrd: how will I receive the secret word?
20:21 πŸ”— JAA kpcyrd: You might need a list after parsing for continuing to work with the response, but the data written to the WARC should simply be the bytes sent by the server with no processing whatsoever (minus TCP/IP layers etc. of course).
20:21 πŸ”— JAA And the data written to the WARC should most definitely not be constructed with an f-string using the parsed response data.
20:22 πŸ”— jc86035 kpcyrd: For the template I'd start with something like {{#switch:{{{1|}}}|key=value|key2=value2|default}}; would you want the switch to be based on a second parameter or on the server name?
20:23 πŸ”— jc86035 The most obvious options are either (1) manually specifying the IRC host or (2) having a preset list of channel names that are non-EFNet
20:29 πŸ”— tech234a has joined #archiveteam-bs
20:29 πŸ”— X-Scale` has joined #archiveteam-bs
20:38 πŸ”— X-Scale has quit IRC (Read error: Operation timed out)
20:38 πŸ”— X-Scale` is now known as X-Scale
20:39 πŸ”— TC01_ has quit IRC (Ping timeout: 252 seconds)
20:40 πŸ”— TC01 has joined #archiveteam-bs
20:43 πŸ”— jc86035 I have been noticed
20:58 πŸ”— JAA Maybe {{IRC|channel|hackint}} should also include "hackint" in the output, at least for as long as EFnet remains our primary network. Many people won't click on that link to join but enter the channel name in their client, so they wouldn't see that it's a different network.
20:58 πŸ”— JAA E.g. "#channel (hackint)" with #channel being linked.
20:59 πŸ”— JAA anarcat: Issues filed.
20:59 πŸ”— anarcat JAA: awesome
21:00 πŸ”— JAA I'm sure there are more issues in the code, but with these obvious ones, I'm not going to dig deeper now.
21:01 πŸ”— jc86035 JAA: I've added the code for that; it looks like it's working
21:02 πŸ”— JAA Looks good, thanks!
21:04 πŸ”— kpcyrd huge shoutout to jc86035! :)
21:04 πŸ”— jc86035 :)
21:05 πŸ”— JAA Also reminds me that I wanted to revamp Template:Infobox_project a long time ago.
21:07 πŸ”— JAA Maybe I'll take another stab at that later or tomorrow.
21:16 πŸ”— jc86035 I've added links to four of the undocumented APIs used in the Wayback Machine website. I'm not sure if anyone knows about them, but I've found them quite useful (especially for re-archiving URLs that were last archived years ago).
21:17 πŸ”— jc86035 The timemap API is by far the most useful one, I think. In the website implementation it's limited to 100,000 results, but if you take the request URL and increase the limit you can easily get it to millions of rows of JSON.
22:03 πŸ”— Mateon1 has quit IRC (Read error: Connection reset by peer)
22:03 πŸ”— Mateon1 has joined #archiveteam-bs
22:08 πŸ”— jc86035 (continuing from yesterday…) how would one use a cron job or similar to feed archivebot at regular intervals? could it go through IRC? could a third-party server just send web requests to it directly? or would it have to be set up internally?
22:10 πŸ”— jc86035 astrid: would these things work? or is there something else you/others had in mind for doing this?
22:10 πŸ”— akierig has quit IRC (Quit: later_gator)
22:29 πŸ”— JAA Er, that could be done, but I'd say it'd be better to run it somewhere else with grab-site or a more specialised tool (depending on the needs) and then upload the WARCs to IA.
22:29 πŸ”— JAA But yes, anything with AB would have to go through IRC.
22:34 πŸ”— X-Scale` has joined #archiveteam-bs
22:37 πŸ”— X-Scale has quit IRC (Read error: Operation timed out)
22:37 πŸ”— X-Scale` is now known as X-Scale
22:40 πŸ”— jc86035 JAA: If you upload WARCs using the standard IA upload form, do they end up in the Wayback Machine or are they just treated like other arbitrary files?
22:42 πŸ”— JAA jc86035: If you upload it correctly (with mediatype "web") *and* your account is whitelisted, it goes into the WBM.
22:43 πŸ”— jc86035 JAA: Makes sense, do IA staff typically grant most requests to be whitelisted?
22:43 πŸ”— JAA Beyond a certain, quite small (in AT context) amount of data, you won't want to use the upload form anyway. 'ia upload' is king.
22:44 πŸ”— JAA I have no idea what the requirements are for getting whitelisted.
22:44 πŸ”— JAA I think they manually verify that the data is good and not tampered with etc.
22:57 πŸ”— jc86035 JAA: I'm not sure where I'd start with it; would I just install grab-site on a server instance, collect data and ask to be whitelisted? would it be better to host such a thing on archiveteam infrastructure? (a lot of my archival stuff runs on Wikimedia Toolforge even though it doesn't really directly benefit the Wikimedia sites, so it would probably be more appropriate to host those scripts somewhere else)
22:58 πŸ”— JAA Yeah, definitely not on Toolforge since this is going to generate a lot more traffic than the /save/ requests you did before.
22:59 πŸ”— JAA "ArchiveTeam infrastructure" as such doesn't really exist; it's just random people's machines dedicated to some purpose. If you have a server and use it for AT, it becomes "AT infra". :-)
22:59 πŸ”— jc86035 It would definitely increase traffic (a lot of my older scripts didn't even archive images due to being xargs-based) but for a lot of them I'd probably limit it to one or two recursions
22:59 πŸ”— JAA For the same reason, we also don't have idle resources lying around really. Machines exist to be used around here.
23:00 πŸ”— jc86035 I don't actually have any spare servers lying around, right now it's just my laptop (and I probably wouldn't want to run grab-site on it)
23:00 πŸ”— JAA So basically yeah, either get a server somewhere or find someone willing to run it on a server of theirs (or for you).
23:01 πŸ”— JAA Sigh, jrwr, your bot still breaks when warriorhq isn't reachable. :-/
23:03 πŸ”— jc86035 I guess I could just keep using Toolforge to make lots of requests to the new Save Page Now, if I don't figure anything else out. Right now it's unnecesarily ad hoc, some of it still goes through via.hypothes.is because of an issue I had with the IA servers
23:04 πŸ”— jc86035 I haven't done any serious maintenance for several months though
23:04 πŸ”— icedice has joined #archiveteam-bs
23:04 πŸ”— jc86035 * to the old Save Page Now
23:06 πŸ”— jc86035 has quit IRC (Quit: Leaving.)
23:10 πŸ”— BlueMax has joined #archiveteam-bs
23:22 πŸ”— markedL I'm interested in setting up a scheduled cron like grab, just we have a fire drill projects at the moment
23:29 πŸ”— tech234a has quit IRC (Quit: Connection closed for inactivity)
23:45 πŸ”— killsushi has joined #archiveteam-bs

irclogger-viewer