[00:41] *** kiska has quit IRC (Remote host closed the connection) [00:41] *** Flashfire has quit IRC (Remote host closed the connection) [00:42] *** kiska has joined #archiveteam-ot [00:42] *** Flashfire has joined #archiveteam-ot [00:45] *** superkuh_ is now known as superkuh [01:07] *** godane has quit IRC (Quit: Leaving.) [01:07] *** godane has joined #archiveteam-ot [01:11] I've been given a login for VampireFreaks (shutting down 1st Feb). I'm going to try to scrape what I can, but if ArchiveTeam are planning a scrape I've got permission to pass on the credentials. [01:13] *** Joseph_ has joined #archiveteam-ot [01:19] *** VerifiedJ has quit IRC (Read error: Operation timed out) [01:27] *** nepeat has joined #archiveteam-ot [01:47] josey, I did start an archivebot scrape, but the Javascript is interfering with parts. What level of contact do you have with the site? [01:49] The web scrape method is the most basic way to preserve. Fancier involves figuring out APIs. Fanciest involves actual direct file or database tables. Usually the information saved is that information that which can be fully public. [01:49] Other information obviously has complications [01:54] Thanks for starting an archivebot scrape. I don't have any contact or connection to the site. It might be worth contacting the site owner about a database, but I don't knowwhat he'd say. [01:54] Which bits are Javascript interfering with? [01:55] *** godane has quit IRC (Ping timeout: 610 seconds) [01:56] main thing I know about right now is the photo albums use JS to show the full image. ArchiveBot can only find the thumbnails. [01:56] also, it doesn't look like older journal entries are visible. Maybe that is different after being logged in? IDK [02:00] Do we have any idea when the scrape will complete? I could go through the scape and get the full size images, and grep for "You do not have permission to view this content" so I knew what to get with the account. [02:00] Would the login info be useful for you? [02:07] re. older journal entries, the oldest I've seen so far is from 2015. [02:07] I'm just off to bed. I'll check back in tomorrow. [02:16] *** godane has joined #archiveteam-ot [02:19] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [03:08] *** fdstw has joined #archiveteam-ot [03:09] *** fdstw has quit IRC (Quit: Leaving) [03:10] *** cerca has quit IRC (Remote host closed the connection) [03:11] *** X-Scale` has joined #archiveteam-ot [03:11] *** fdstw has joined #archiveteam-ot [03:14] *** godane has quit IRC (Read error: Connection reset by peer) [03:17] *** X-Scale has quit IRC (Ping timeout: 610 seconds) [03:17] *** X-Scale` is now known as X-Scale [03:19] *** qnisz has joined #archiveteam-ot [03:25] *** fdstw has quit IRC (Read error: Operation timed out) [03:29] *** X-Scale` has joined #archiveteam-ot [03:34] *** X-Scale has quit IRC (Read error: Operation timed out) [03:34] *** X-Scale` is now known as X-Scale [03:53] josey, IDK what the scrape timeline looks like. You can check progress in the http://dashboard.at.ninjawedding.org/?showNicks=1 and look for vampirefreaks or job id dqnnylxugllw9uyhtou49q4yc [03:54] 87k in queue 422k done [04:13] *** qw3rty__ has joined #archiveteam-ot [04:17] *** qw3rty_ has quit IRC (Ping timeout: 276 seconds) [04:17] *** m007a83 has joined #archiveteam-ot [04:17] *** m007a83 has quit IRC (Read error: Connection reset by peer) [04:18] *** m007a83 has joined #archiveteam-ot [04:19] *** qnisz has quit IRC (Quit: Leaving) [05:00] *** godane has joined #archiveteam-ot [05:36] If Seagate is willing to just give a goofy Youtuber about $50k in free harddrives, I wonder what AT/IA can get by asking nicely. https://www.youtube.com/watch?v=eCz-IixxR_k [05:38] Internet Archive Sponsored by Seagate [05:39] We needs a marketing team that gives good mouth spin. [06:32] JAA: there are many non indexed items in archiveteam_youtube [06:35] iirc there was a mass noindex'ing because just please stop using up space to mirror youtube [07:38] *** oxguy3 has quit IRC (My MacBook has gone to sleep. ZZZzzz…) [07:38] *** oxguy3 has joined #archiveteam-ot [07:38] *** oxguy3 has quit IRC (Client Quit) [07:39] *** oxguy3 has joined #archiveteam-ot [07:39] *** oxguy3 has quit IRC (Client Quit) [07:41] *** oxguy3 has joined #archiveteam-ot [07:41] *** oxguy3 has quit IRC (Client Quit) [07:42] *** oxguy3 has joined #archiveteam-ot [07:42] *** oxguy3 has quit IRC (Client Quit) [07:42] *** oxguy3 has joined #archiveteam-ot [07:42] *** oxguy3 has quit IRC (Client Quit) [07:43] *** oxguy3 has joined #archiveteam-ot [07:43] *** oxguy3 has quit IRC (Client Quit) [07:44] *** oxguy3 has joined #archiveteam-ot [07:44] *** oxguy3 has quit IRC (Client Quit) [07:45] *** oxguy3 has joined #archiveteam-ot [07:45] *** oxguy3 has quit IRC (Client Quit) [07:46] *** oxguy3 has joined #archiveteam-ot [07:46] *** oxguy3 has quit IRC (Client Quit) [07:46] *** oxguy3 has joined #archiveteam-ot [07:46] *** oxguy3 has quit IRC (Client Quit) [07:47] *** oxguy3 has joined #archiveteam-ot [07:47] *** oxguy3 has quit IRC (Client Quit) [07:52] *** dhyan_nat has joined #archiveteam-ot [08:33] *** dhyan_nat has quit IRC (Read error: Connection reset by peer) [08:33] *** dhyan_nat has joined #archiveteam-ot [09:15] Raccoon`, can we get WD drives instead? :) [09:16] "Hey WD, Seagate offered us 100 10TB drives, would you guys do 200?" [09:18] exactly [09:19] in fairness, that video channel was started in 2008 and got the drives in 2017. I guess ArchiveTeam is about the same age and is due for drives :) [09:34] DigDug told is about it in #archivebot [09:34] told us* [09:34] wrong channel, derp [09:35] *** DigiDigi has quit IRC (Remote host closed the connection) [09:51] I am sure that hard drive companies are in the business of giving away free drives to weirdos and charities [10:04] If it means a big sponsored ad at the bottom of every page for, say, 5 years [10:06] And the rights to assert "Official Sponsor of the Internet Archive -- ``Preserving our history for them.``" or some touching slogon that doesn't exist yet. [10:31] *** josey9 has joined #archiveteam-ot [10:31] As I remember (though going through in a few minutes now, I can't find any eamples), there's sort of a shift of attitude in early discussion of the IA from considering overt commercial "partnerships" like this as being potentially useful to opposing them on semi-philosophical grounds of independence [10:36] By the way, here's Brewster Kahle without glasses: https://web.archive.org/web/20030305092730im_/http://chronicle.com/photos/v44/i26/4426a271.jpg [10:36] *** josey has quit IRC (Ping timeout: 745 seconds) [10:42] *** BlueMax has quit IRC (Read error: Connection reset by peer) [10:51] that independence thing is probably the key. IA probably doesn't want any heavy dependency (life or death level) on a commercial partnership [11:30] *** josey has joined #archiveteam-ot [11:37] *** josey9 has quit IRC (Ping timeout: 745 seconds) [12:00] astrid: Yup, and I'm pretty sure ola_norsk was told about that at the time because he was one of the "lol tubeup!" users. [14:35] Fionera: Hello nerd [14:35] hi :D [14:51] Hewo! [15:01] *** Mateon1 has quit IRC (Remote host closed the connection) [15:01] *** Mateon1 has joined #archiveteam-ot [15:10] smh i am already ages here [15:52] I can't see the job for vampirefreaks (id dqnnylxugllw9uyhtou49q4yc) on the ArchiveBot tracker anymore. Does that mean it's done? Is it possible to get a list of URLs so I can crawl the links that require a login? [15:59] atphoenix: ^ [16:18] Are the nonindexed YouTube still stored by IA? [16:31] *** systwi has quit IRC (Ping timeout: 622 seconds) [16:38] AT can use drives if IA does not want them [16:40] *** systwi has joined #archiveteam-ot [16:54] *** Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat) [16:54] *** Craigle has joined #archiveteam-ot [16:57] josey, check https://archive.fart.website/archivebot/viewer/ [16:58] https://archive.fart.website/archivebot/viewer/job/dqnny [16:59] 5.4 GiB [17:00] yes the vampirefreaks crawl finished. I know that the photos are thumbnails. [17:01] a custom crawl of the photo URLs could fix those gaps [17:02] *** DigiDigi has joined #archiveteam-ot [17:25] *** DigiDigi has quit IRC (Remote host closed the connection) [17:31] *** DigiDigi has joined #archiveteam-ot [18:06] *** jamiew has joined #archiveteam-ot [18:12] *** jamiew has quit IRC (zzz) [18:13] *** jamiew has joined #archiveteam-ot [18:21] *** NickN00b has joined #archiveteam-ot [18:23] SketchCow: ah damn it, I thought I saw you at magfest, but I wasn't sure if it was you. I should've stopped and said hi [19:09] *** jamiew has quit IRC (Textual IRC Client: www.textualapp.com) [21:01] *** schbirid has joined #archiveteam-ot [21:33] *** apache2_ has quit IRC (Remote host closed the connection) [21:33] *** jodizzle has quit IRC (Read error: Operation timed out) [21:33] *** jodizzle has joined #archiveteam-ot [21:35] *** Jens has quit IRC (Remote host closed the connection) [21:35] *** apache2 has joined #archiveteam-ot [21:36] *** Jens has joined #archiveteam-ot [21:45] *** apache2 has quit IRC (Remote host closed the connection) [21:47] *** apache2 has joined #archiveteam-ot [21:47] *** Jens has quit IRC (Remote host closed the connection) [21:48] *** Jens has joined #archiveteam-ot [21:55] *** ivan has quit IRC (Quit: Leaving) [21:56] *** jodizzle has quit IRC (Quit: ZNC 1.7.1 - https://znc.in) [21:56] *** jodizzle has joined #archiveteam-ot [22:11] *** ivan has joined #archiveteam-ot [22:12] *** svchfoo3 sets mode: +o ivan [22:12] *** svchfoo1 sets mode: +o ivan [22:25] *** schbirid has quit IRC (Quit: Leaving) [22:35] *** Joseph__ has joined #archiveteam-ot [22:35] *** Joseph_ has quit IRC (Read error: Connection reset by peer) [22:39] *** Mateon1 has quit IRC (Ping timeout: 258 seconds) [22:39] It was me. [22:41] So when I was running the manual python yahoo groups download I saw the nice retry with backoff timer if there's a http error stuff, is that all handled by a specific part of the code that I could use to run a manual download of my own? I'm looking through the code on github to see if I can find it [22:42] Thanks atphoenix for the links. [22:44] e.g. I have a list of ~400k image urls and want retries and proper logging of any failures (my 5min bash script just checks filesize and retries if it's 0) [22:48] *** BlueMax has joined #archiveteam-ot [22:54] or is there a recommended script/tool that I could try? [22:58] *** Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat) [23:51] *** dhyan_nat has quit IRC (Read error: Operation timed out)