[01:43] *** pizzaiolo has quit IRC (Remote host closed the connection) [01:47] *** dashcloud has quit IRC (Ping timeout: 492 seconds) [01:51] *** dashcloud has joined #archiveteam-bs [02:01] *** RichardG has quit IRC (Read error: Connection reset by peer) [02:01] *** RichardG has joined #archiveteam-bs [03:56] Do we have a technique to archive an entire subreddit? [04:01] *** Stilett0 has joined #archiveteam-bs [04:18] *** BlueMax has joined #archiveteam-bs [04:20] *** qw3rty115 has joined #archiveteam-bs [04:23] *** qw3rty114 has quit IRC (Read error: Operation timed out) [05:24] SketchCow: i'm having trouble connecting to FOS [05:30] btw i found out i got a complete copy of Body Mind And Soul The Mystery And The Magic on another tape [05:30] it was missing over a 1 hour on the first tape with it [05:38] nevermind about FOS having problems [05:38] its fine now [05:44] SketchCow: btw since i mailed those tapes you can sent me more tapes [05:45] i also need more shipping labels to mail the rest [07:39] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) [07:39] *** dashcloud has joined #archiveteam-bs [07:41] *** schbirid has joined #archiveteam-bs [08:57] phuzion: Not really, no. Currently, it's still possible in theory to do that with the search API, but that will be gone soon as well. [08:57] Otherwise, you only get the 1000 newest/top/... threads. [08:58] Large threads also cause various issues since you have to follow the "load more comments" links etc. [08:58] (Which are handled with JS, no less.) [09:05] *** BlueMaxim has joined #archiveteam-bs [09:11] *** BlueMax has quit IRC (Read error: Operation timed out) [09:17] *** RichardG has quit IRC (se.hub irc.efnet.nl) [09:17] *** will has quit IRC (se.hub irc.efnet.nl) [09:17] *** Smiley has quit IRC (se.hub irc.efnet.nl) [09:17] *** BnAboyZ has quit IRC (se.hub irc.efnet.nl) [09:17] *** kisspunch has quit IRC (se.hub irc.efnet.nl) [09:17] *** Zebranky has quit IRC (se.hub irc.efnet.nl) [09:17] *** MrRadar2 has quit IRC (se.hub irc.efnet.nl) [09:17] *** BnARobin has quit IRC (se.hub irc.efnet.nl) [09:17] *** jtn2 has quit IRC (se.hub irc.efnet.nl) [09:17] *** Tenebrae has quit IRC (se.hub irc.efnet.nl) [09:17] *** Fusl has quit IRC (se.hub irc.efnet.nl) [09:17] *** hook54321 has quit IRC (se.hub irc.efnet.nl) [09:17] *** ez has quit IRC (se.hub irc.efnet.nl) [09:17] *** Polylith has quit IRC (se.hub irc.efnet.nl) [09:25] JAA: I could easily add support for clicking these links to chromebot. [09:27] Yeah, that part is solvable. I'm more worried about the search API change and not being able to find all threads in a subreddit anymore. [09:29] However replay won’t work, since it uses POST requests. [09:37] *** BlueMaxim has quit IRC (Leaving) [10:10] The Pushshift API could be a workaround for Reddit crippling the search API. https://redditsearch.io/ [10:11] Also, TIL that the Pushshift archives do not contain deleted comments. [10:12] They have a realtime component for the API which should have everything that was available on Reddit for more than a second or so. [10:12] But the archives are independent monthly crawls. Comments deleted between the posting time and the time of the monthly crawl are lost. [10:12] Is anyone aware of other Reddit archives? [11:20] *** pizzaiolo has joined #archiveteam-bs [11:22] *** mabynogy has joined #archiveteam-bs [11:37] *** RichardG has joined #archiveteam-bs [11:37] *** will has joined #archiveteam-bs [11:37] *** Smiley has joined #archiveteam-bs [11:37] *** BnAboyZ has joined #archiveteam-bs [11:37] *** kisspunch has joined #archiveteam-bs [11:37] *** Zebranky has joined #archiveteam-bs [11:37] *** MrRadar2 has joined #archiveteam-bs [11:37] *** BnARobin has joined #archiveteam-bs [11:37] *** jtn2 has joined #archiveteam-bs [11:37] *** Tenebrae has joined #archiveteam-bs [11:37] *** Fusl has joined #archiveteam-bs [11:37] *** hook54321 has joined #archiveteam-bs [11:37] *** ez has joined #archiveteam-bs [11:37] *** Polylith has joined #archiveteam-bs [12:45] *** ranavalon has joined #archiveteam-bs [12:47] *** bitspill has quit IRC () [12:48] *** bitspill has joined #archiveteam-bs [13:10] *** midas has quit IRC () [13:11] *** midas has joined #archiveteam-bs [13:40] *** DrasticAc has quit IRC () [13:40] *** DrasticAc has joined #archiveteam-bs [14:12] so i'm over 12k items this month [14:12] 49,128 items so far this year [14:31] *** Mateon1 has quit IRC (Ping timeout: 252 seconds) [14:31] *** Mateon1 has joined #archiveteam-bs [14:56] *** riking has quit IRC () [14:56] *** riking has joined #archiveteam-bs [14:57] *** ThisAsYou has quit IRC () [14:57] *** ThisAsYou has joined #archiveteam-bs [15:08] *** dogsrcool has joined #archiveteam-bs [15:15] *** VerifiedJ has joined #archiveteam-bs [15:17] JAA that's why he created it [15:18] Has a project for oddshot.tv been proposed? [15:18] https://oddshot.tv/ [15:18] https://medium.com/the-oddshot-loop/end-of-an-era-aefeca0420bf [15:19] "Oddshot.tv will shutdown it's servers and applications on Monday February 12th Video files will not be accessible after this time." [15:19] I've started archiving 4chan /g board - I have a json file per day - I plan to add an image scrapper soon to collect the memes - I'd like to put that somewhere where anybody could download it - any idea about that? [15:20] fuck. the scanner I'm using for these old SF mags is generating TIFFs as required.... but the compression scheme inside the TIFF is "JPEG" [15:22] i dont think the scanner gives me the option of controlling the compression type either [15:22] API documentation for oddshot is here: https://api.oddshot.tv/docs/ [15:23] This seems like a perfect AT project [15:29] mundus: Hm? [15:30] the pushshift api [15:31] Yes [15:31] Ah, you were talking about the search part? [15:31] I thought this was in reference to the deleted comments. [15:32] yeah, the search [15:32] The deleted comments exist before he started live archiving [15:33] I think he started live archiving like 2 years ago [15:33] https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/ [15:39] mundus: https://www.reddit.com/r/datasets/comments/71lkf1/pushshift_dataset_has_at_least_11_days_delay_in/dnbrl8t/?context=10000 [15:39] hmm [16:00] *** atrocity has joined #archiveteam-bs [16:29] godane: I will tell my folks! [16:29] *** ZexaronS has joined #archiveteam-bs [16:31] Oddshot has about 15 million video views per month: https://www.reddit.com/r/GlobalOffensive/comments/7wd4qh/psa_oddshot_is_shutting_down/dtzeyav/?context=1 [16:31] Maybe that helps for estimating the size. [16:31] SketchCow: i will also need labels to mail my own vhs tapes that i have digitize [16:55] so the tape i'm capturing gave me some trouble [16:56] like almost static in middle of picture [16:56] i start capturing in part of the tape i had no picture and got picture [16:57] so i'm going to see about recapturing it [16:58] nevermind going to skip this part of tape [17:24] anyways i skipped that tape cause it kept giving me no video after awhil [17:24] *awhile [17:30] so i got HBO First Look at Nine Months [17:30] and On The Set Judge Dredd videos [17:47] i'm at 124k for DTIC Archive [17:59] *** icedice has joined #archiveteam-bs [18:00] *** ZexaronS has quit IRC (Quit: Leaving) [18:07] i'm running a recheck of dtic archive pdfs to see that i uploaded everything in older areas [18:07] already found one number that doesn't have a item page [19:06] *** pizzaiolo has quit IRC (Remote host closed the connection) [19:09] *** pizzaiolo has joined #archiveteam-bs [19:14] *** w0rp has quit IRC (Read error: Operation timed out) [19:16] *** w0rp has joined #archiveteam-bs [19:22] *** REiN^ has quit IRC (Read error: Operation timed out) [19:24] *** icedice2 has joined #archiveteam-bs [19:27] *** icedice2 has quit IRC (Client Quit) [19:27] *** icedice has quit IRC (Read error: Operation timed out) [19:28] *** Ravenloft has quit IRC (Read error: Operation timed out) [20:15] my latest digitize tapes: https://www.patreon.com/posts/digitize-tapes-16904851 [20:19] *** schbirid has quit IRC (Quit: Leaving) [20:20] *** ola_norsk has joined #archiveteam-bs [20:21] hi. I'm wondering if someone could help me out in regards to using S3 at IA..eg there's a setting in the app i'm using "S3_ROOT=s3://bucket/path/" .. [20:24] i don't want to mess anything up or causing clutter, so while making sense of https://archive.org/help/abouts3.txt , i'd prefer some input [20:26] e.g what's a "S3 Bucket" ? [20:26] s3 bucket => ia item [20:27] ty. So ideally i could actually have one single item to WARC to ? [20:29] e.g one "Ola_Norsk_WARCS" item etc [20:29] *** jschwart has joined #archiveteam-bs [20:32] *** RichardG has quit IRC (Read error: Connection reset by peer) [20:34] the sofware is webrecorder btw https://github.com/webrecorder/webrecorder ..and I'm guess "path" would be collection then [20:34] *** RichardG has joined #archiveteam-bs [20:34] e.g "Ola_Norsk_WARCS/twitter_hashtag_netneutrality".. [20:34] *** schbirid has joined #archiveteam-bs [20:36] or "Ola_Norsk_WARCS/some_collection_name/" [20:42] or e.g, since i focus on twitter for the time being; "S3_ROOT=s3://twitter_hashtags_//" [20:45] i guess it will be safer if i try it out on a test item first :] [20:51] *** schbirid has quit IRC (Quit: Leaving) [20:55] e.g "S3_ROOT=s3://s3.us.archive.org/ola_norsk_warcs_test" would be correct for a single item? [21:35] *** mabynogy has quit IRC (Quit: dpt.slasheva.com) [21:42] *** REiN^ has joined #archiveteam-bs [21:44] *** ZexaronS has joined #archiveteam-bs [21:45] *** mabynogy has joined #archiveteam-bs [21:50] *** BlueMax has joined #archiveteam-bs [22:12] *** mabynogy has quit IRC (Quit: dpt.slasheva.com) [22:12] *** VerifiedJ has left [22:33] *** pizzaiolo has quit IRC (pizzaiolo) [22:33] *** pizzaiolo has joined #archiveteam-bs [22:37] *** pizzaiolo has quit IRC (Client Quit) [22:38] *** godane has quit IRC (Read error: Operation timed out) [22:38] *** pizzaiolo has joined #archiveteam-bs [22:48] arkiver, JAA SketchCow DFJustin do we know about this, https://medium.com/the-oddshot-loop/end-of-an-era-aefeca0420bf closing shop monday.... [22:49] "Over the last year, since introducing upload to the platform, we noticed more and more unsavory NSFW content being uploaded, which quickly became almost impossible to moderate. The frontpage in which we used to love everyday, seeing the best daily / weekly / monthly video highlights became a page we avoided and hated." [22:49] ....always the fappers that ruin everything... [22:49] *** godane has joined #archiveteam-bs [22:49] *** godane has quit IRC (Client Quit) [22:53] odemg: Yes, we know about it. [22:56] JAA, are we already getting it or is it too late? [22:58] I threw it into ArchiveBot earlier, which grabbed some of it, but I'm not aware of any systematic efforts. [22:59] The COO is active on Reddit. Could be worth trying to contact them for more information (how large in total, possibility of a bulk export, etc.) [23:01] UnfunMid is the username if you want to give that a shot (heh). [23:04] *** jschwart has quit IRC (Quit: Konversation terminated!) [23:07] fucking photobucket. is there any way to get the images out? [23:07] http://pic.photobucket.com/bwe.png [23:09] Need to send the right referrer etc. [23:09] Yep, they suck. [23:09] http://photobucket.com/gallery/user/beitstuck/media/cGF0aDovQm9va0ZvcnQucG5n/?ref= [23:09] uhhh [23:09] i'm seeing bwe, on photobucket. [23:09] do you see an image there? [23:10] HOLY SHIT THE ADS, i opened incognito [23:10] lol, that's an interesting one. [23:10] Well yeah, image hosting is not exactly a profitable business. [23:10] could bee data limit? [23:11] bwe = bandwidth exceeded [23:11] Is there a bandwidth limit now as well? [23:11] I thought that was only for external requests. [23:11] I hit the "random" button on the site i'm writing an archive viewer for [23:12] bandwidth limit is serverside is it not? [23:12] https://support.photobucket.com/hc/en-us/articles/200724504-Storage-Vs-Bandwidth-What-s-The-Difference- [23:13] Sounds like they still only limit bandwidth that comes from third party sites. [23:13] Which makes sense, since they plaster their own website with ads. [23:13] So they already make money from those visitors. [23:13] going to bet: that's the policy but it's not the implementation. [23:14] actually i think it's an account wide flag? [23:15] That's definitely possible. Photobucket being a piece of shit wouldn't exactly be breaking news... [23:16] i clicked around and noticed load times, as if they weren't sure whether they wanted to say "fuck you" or not [23:16] Or their servers just suck... [23:16] * to clarify: as if nobody had asked for that image in years [23:16] completely cold cache [23:20] riking: On that link you mentioned, it might be a broken image gallery or so. I clicked the right arrow once and started seeing images and couldn't get back to the bwe.png afterwards. [23:20] Right so if you're investigating, here's the image source https://mspfa.com/?s=540&p=1 [23:21] source has not been audited for quality :P [23:21] Also, interestingly, I can download the first image in that gallery just fine with curl without a referrer or proper user agent. [23:21] http://i996.photobucket.com/albums/af84/sharonitzhaki/icons/65s-1.jpg is the link for that. [23:22] i'm seeing a different username [23:22] Huh [23:22] Oh yeah, I somehow got redirected to a different gallery, dafuq? [23:25] I guess that was the right arrow click then. [23:28] *** BlueMax has quit IRC (Leaving) [23:32] I found a few projects on GitHub which try to work around this problem. Looks like the method used by https://github.com/nicinabox/fixpb still works, for example. [23:33] curl -v -H 'Referer: http://photobucket.com/gallery/user/kk251/media/NORTON%20TANK%20on%20V11_zpsibge9erf.jpg' 'https://i282.photobucket.com/albums/kk251/BKL93908/NORTON%20TANK%20on%20V11_zpsibge9erf.jpg' >/dev/null [23:33] Returns a ~70 kB JPEG [23:34] *** REiN^ has quit IRC (Read error: Operation timed out) [23:35] Hm, now I also get it without specifying the referrer. Might be in some server-side cache now or something. [23:36] Huh, I guess the redirects to bwe.png might also get cached. lol [23:37] That would also explain why your original link got the error, riking. [23:38] Oh nice. [23:38] So once you trigger the redirect, you need to wait six hours (according to the Expires header) until you can access it again. [23:38] Fuck Photobucket. [23:39] hoooo wee [23:43] *** icedice has joined #archiveteam-bs [23:45] Regarding the size of Oddshot: https://www.reddit.com/r/DataHoarder/comments/7wdcb8/oddshottv_the_stream_clip_hosting_service_is/du0b7ri/ [23:45] No estimate yet, but the COO will likely post it there. [23:45] Also, there's an API, but you can't just get access immediately: https://www.reddit.com/r/Oddshot/comments/4szqxe/oddshot_api/ [23:45] Well, maybe that information is outdated, but I can't find anything else about it. [23:47] I could've sworn I saw a post about it on Reddit somewhere earlier today, but I can't find it anymore. [23:51] Everything on the website happens through POST requests with GraphQL. [23:55] there's a tool to get graphql docs [23:55] Oh hey, there we go: api.oddshot.tv [23:56] "Our API is public and you can get a key from your account profile on Oddshot, there is a 'show key' button." [23:56] https://www.reddit.com/r/GlobalOffensive/comments/5pg0wh/female_cs_elegiggle/dcrfwru/ [23:56] graphdoc -e https://gql.twitch.tv/gql -x Client-Id: kimne78kx3ncx6brgo4mv6wki5h1ko -x Authorization: OAuth 2v6zpwzz1ghjb3ujv0gzuve98ue7db -o graphql-docs/ -f [23:56] ... those weren't important [23:58] and key revoked, don't bother trying to use that oauth. [23:58] The API doesn't expose all data though, as far as I can tell. [23:58] well yeah, that's why you use graphdoc with a Cookie: header [23:58] or whatever the live site is actually using