[01:06] *** Stiletto has joined #archiveteam-bs [01:08] *** Stilett0 has quit IRC (Read error: Operation timed out) [01:13] *** Stilett0 has joined #archiveteam-bs [01:17] *** Stiletto has quit IRC (Read error: Operation timed out) [01:33] *** ndiddy has quit IRC (Read error: Operation timed out) [01:41] *** ndiddy has joined #archiveteam-bs [01:54] bithippo: Might also look at https://hackage.haskell.org/package/github-backup [01:54] :thumbs up: [02:09] *** JAA_ has joined #archiveteam-bs [02:16] *** JAA_ has quit IRC (leaving) [02:19] *** JAA_ has joined #archiveteam-bs [02:19] *** JAA sets mode: +o JAA_ [02:22] *** JAA has quit IRC (leaving) [02:22] *** JAA_ is now known as JAA [02:36] *** ndiddy has quit IRC (Ping timeout: 492 seconds) [03:10] *** jacketcha has joined #archiveteam-bs [03:31] https://twitter.com/textfiles/status/970448544271233024 lol, looks like saying "guys, don't archive this!" is a good way of finding new archivists for that particular content niche. We should keep that in mind. ;-) [03:48] you wouldn't download a website [04:12] *** qw3rty113 has joined #archiveteam-bs [04:14] It sure is great how twitter auto-detects when JS is disabled and redirects you to a version of their site that hides all the post content. Super usable. [04:18] *** qw3rty112 has quit IRC (Read error: Operation timed out) [05:05] *** dashcloud has quit IRC (Read error: Operation timed out) [05:08] *** dashcloud has joined #archiveteam-bs [06:10] *** odemg has quit IRC (Read error: Operation timed out) [06:21] *** odemg has joined #archiveteam-bs [06:35] *** Pixi` has quit IRC (Quit: Pixi`) [06:35] *** Pixi has joined #archiveteam-bs [06:59] *** wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES) [07:06] *** odemg has quit IRC (Read error: Connection reset by peer) [07:08] *** dashcloud has quit IRC (Read error: Operation timed out) [07:22] *** odemg has joined #archiveteam-bs [07:23] *** h3x has quit IRC (Read error: Operation timed out) [07:24] *** h3x has joined #archiveteam-bs [07:37] *** schbirid has joined #archiveteam-bs [07:47] *** odemg has quit IRC (Ping timeout: 252 seconds) [08:00] *** odemg has joined #archiveteam-bs [10:21] *** BlueMax has quit IRC (Read error: Connection reset by peer) [10:33] *** Mateon1 has quit IRC (Read error: Operation timed out) [10:33] *** Mateon1 has joined #archiveteam-bs [11:06] *** odemg has quit IRC (Read error: Operation timed out) [11:14] *** wp494 has joined #archiveteam-bs [11:20] *** odemg has joined #archiveteam-bs [11:55] *** odemg has quit IRC (Read error: Connection reset by peer) [12:15] *** odemg has joined #archiveteam-bs [13:21] *** odemg has quit IRC (Read error: Connection reset by peer) [13:36] *** odemg has joined #archiveteam-bs [15:25] *** odemg has quit IRC (Read error: Connection reset by peer) [15:52] *** odemg has joined #archiveteam-bs [15:52] *** odemg has quit IRC (Connection closed) [15:53] *** Dimtree has quit IRC (Peace) [15:56] *** odemg has joined #archiveteam-bs [15:56] *** odemg has quit IRC (Connection closed) [15:56] *** Dimtree has joined #archiveteam-bs [15:59] *** odemg has joined #archiveteam-bs [16:07] *** odemg has quit IRC (Read error: Connection reset by peer) [16:18] *** odemg has joined #archiveteam-bs [17:30] *** ld1 has quit IRC (Read error: Connection reset by peer) [17:31] *** ld1 has joined #archiveteam-bs [17:46] *** Dimtree has quit IRC (Peace) [17:51] *** Dimtree has joined #archiveteam-bs [17:59] *** Pixi has quit IRC (Quit: Pixi) [18:03] *** jschwart has joined #archiveteam-bs [18:09] JAA: Running [18:09] Thanks [18:10] 599c8fa2311eff5cfc358407fd262642e3db4034af6e10241e5af28a392e536f charlierose.com-videos-00000.warc.gz [18:11] Sweet, thank you. :-) [18:14] Do I delete this now [18:17] No [18:17] But it means that I can delete WARC 00000 on my machine and continue grabbing. [18:18] And upload WARCs 1 through 43 and then delete those as well, etc. [18:18] I'll let you know when it's done and ready for transfer to IA. [18:18] Will probably take a week or so in total. [18:19] *** Pixi has joined #archiveteam-bs [18:20] *** SynMonger has joined #archiveteam-bs [18:29] *** JAA sets mode: +o SketchCow [18:45] *** K4k has joined #archiveteam-bs [18:58] *** powerKitt has quit IRC (Quit: powerKitt) [19:11] *** powerKitt has joined #archiveteam-bs [19:24] *** fsr has joined #archiveteam-bs [19:27] *** odemg has quit IRC (Read error: Connection reset by peer) [19:27] *** h3x has quit IRC (Read error: Connection reset by peer) [19:27] Is there currently any work done on archiving the Wii Shop Channel? There is an article in the wiki, anything else? [19:28] Yeah, guy called Larsenv has been working to archive stuff [19:29] He hasn't been able to get the secret word though, so he doesn't have an AT wiki account. [19:29] Ah, yes, I read his post on RiiConnect. I was going to ask if archiveteam and the RiiConnect people are working together. Seems like a yes then. :) [19:29] If anyone knows what the current one is, PM me it and I'll pass it on to him. [19:29] the "secret word"? [19:30] yeah, you need a passphrase from the IRC to sign up on archiveteam.org [19:30] sign up for the wiki? [19:30] yeah [19:31] I see, thanks. What would be the best way to get in touch with larsenv? I'd be interested to know what he plans to do and what he has already done. [19:32] do you have a discord account? [19:32] nope [19:34] I have one now. [19:40] *** fsr has quit IRC () [19:47] *** odemg has joined #archiveteam-bs [19:55] How do I get wpull to rewrite http://tinypic.com/images/404.gif to a 404 response? [19:56] currently have a workaround in my later-running 404 detection/download-rescued-from-wayback [20:07] *** odemg has quit IRC (Ping timeout: 255 seconds) [20:10] @riking: You're probably better off using grab-site to perform that transform while archving [20:10] https://github.com/ludios/grab-site [20:11] --custom-hooks=PY_SCRIPT: Copy PY_SCRIPT to DIR/custom_hooks.py, then exec DIR/custom_hooks.py on startup and every time it changes. The script gets a wpull_hook global that can be used to change crawl behavior. See update_custom_hooks in libgrabsite/wpull_hooks.py and custom_hooks_sample.py. [20:13] ... okay and now I'm dealing with a wpull crash. [20:13] https://paste.ubuntu.com/p/wVdhRH7CMf/ [20:14] it really doesn't like that blob:. The browser doesn't either but [20:23] *** bsmith094 has quit IRC (Ping timeout: 252 seconds) [20:23] *** odemg has joined #archiveteam-bs [21:07] *** RichardG has quit IRC (Read error: Connection reset by peer) [21:09] *** RichardG has joined #archiveteam-bs [21:31] *** bsmith093 has joined #archiveteam-bs [21:37] *** ranav has joined #archiveteam-bs [21:42] bithippo: hm, well i'm already doing a ton of post-processing so I guess i'll just stick with what I have. [21:42] *** ranavalon has quit IRC (Ping timeout: 633 seconds) [21:43] *** BlueMax has joined #archiveteam-bs [21:43] Sorry I couldn't be more hlep. [21:43] I wonder if I should be segregating the non-pristine fetches into a separate WARC so that the original can be used? (e.g. I download an image from Wayback, insert into .warc with stomped WARC-Target-URI: header -> put that in a separate warc file) [21:44] That's a question for someone more familiar with our the Wayback Machine and CDX server works. [21:44] right now I'm uploading the WARC files to archive items, not to Wayback proper. [21:44] but if someone tries to throw my warc files into wayback, those records are almost certainly going to mess it up [21:45] eh sure might as well. [21:49] up to 2700 lines, 67Kbytes of script. heh. [21:51] When in doubt, WARCs should not have their content or headers adulterated. [21:53] Here's a screenshot of what it's doing right now https://usercontent.irccloud-cdn.com/file/ZPQcNv29/image.png [21:56] followed by the image as recieved from that request [21:58] *** BlueMax has quit IRC (Leaving) [22:00] *** jschwart has quit IRC (Quit: Konversation terminated!) [22:08] @riking: What is the eventual target? Wayback? Or something else? [22:08] Target is this stuff https://ia801505.us.archive.org/12/items/MSPFA_204/view.html?s=204&p=1 [22:09] if you open devtools you can see it using Range: requests to pick retrieved files out of the warc [22:11] Archiving individual MS Paint Art stories as individual items? [22:11] so you can see why having it all in one file is nice, but. [22:11] Yeah, with all necessary subresources included. [22:12] So, naively, it seems like the archiving operation isn't the issue, but that you need a more robust player to interpret the archived content for "playback" [22:13] That sounds right [22:13] I apologize if I'm being obtuse, I assure you I'm attempting to be helpful :D [22:13] I can update the script to have varying warc filenames. Just wanted to make sure it was worth doing so first [22:15] Want the whole site ripped? [22:15] A previous version saved all the images as individual files. Then I noticed both the text on the help page and the reason for, "please do not have more than 1000 files per item" [22:15] the devs are "currently" working on a version that doesn't use a js app for page advances [22:15] but yes that's the idea [22:16] Right now I'm slowly marching up through the story IDs, finding things that trip up my script - either the archiver or the player [22:17] Total stories you need to walk? [22:18] most recent created ID is 24778 [22:19] minus deleted entries [22:21] *** BlueMax has joined #archiveteam-bs [22:32] *** alex___ has joined #archiveteam-bs [22:34] @riking: You might consider setting up an ArchiveTeam project for this [22:34] oh right [22:36] *** schbirid has quit IRC (Quit: Leaving) [22:43] riking: are you doing mspfa? [22:43] Yeah [22:43] Nice. [22:44] when the story uses photobucket, the archived items are now the best place to read it [22:45] Yeah, Photo [22:45] *Photobucket is fucking terrible [23:10] *** Mayonaise has quit IRC (Read error: Connection reset by peer) [23:32] @powerkitt: https://github.com/bibanon/PB_Spade for those PB shenanigans