[00:03] *** ranav has quit IRC (Read error: Connection reset by peer) [00:14] *** ranavalon has joined #archiveteam-bs [00:14] *** ranavalon has quit IRC (Remote host closed the connection) [00:15] *** ranavalon has joined #archiveteam-bs [00:18] *** BlueMaxim has quit IRC (Leaving) [01:00] *** ranavalon has quit IRC (Quit: Leaving) [01:15] *** BlueMaxim has joined #archiveteam-bs [01:42] *** yuitimoth has quit IRC (Read error: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac) [01:42] *** yuitimoth has joined #archiveteam-bs [01:54] *** yuitimoth has quit IRC (Read error: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac) [01:54] *** yuitimoth has joined #archiveteam-bs [02:12] *** DFJustin has quit IRC (Remote host closed the connection) [02:15] *** DFJustin has joined #archiveteam-bs [02:15] *** swebb sets mode: +o DFJustin [02:38] Is it possible to edit an item's collection it belongs to after creating said item? [02:43] *** bithippo has quit IRC (Ping timeout: 260 seconds) [03:29] *** atlogbot has quit IRC (Read error: Operation timed out) [03:29] *** swebb has quit IRC (Read error: Operation timed out) [03:30] *** swebb has joined #archiveteam-bs [03:30] *** atlogbot has joined #archiveteam-bs [03:30] *** svchfoo3 sets mode: +o swebb [03:30] *** svchost03 sets mode: +v atlogbot [04:46] *** jdude104 has quit IRC (Read error: Operation timed out) [04:49] *** qw3rty14 has joined #archiveteam-bs [04:53] *** qw3rty13 has quit IRC (Read error: Operation timed out) [05:05] *** K4k has quit IRC (Read error: Connection reset by peer) [05:42] *** w0rp has quit IRC (Ping timeout: 245 seconds) [05:45] *** w0rp has joined #archiveteam-bs [06:28] *** zyphlar has joined #archiveteam-bs [07:04] *** sekolyn has joined #archiveteam-bs [07:05] *** octothorp has quit IRC (Read error: Operation timed out) [07:28] *** octothorp has joined #archiveteam-bs [07:29] *** sekolyn has quit IRC (Read error: Operation timed out) [07:29] *** kpz has joined #archiveteam-bs [07:30] *** kpz has left [07:44] *** Asparagir has joined #archiveteam-bs [08:38] *** zyphlar has quit IRC (Quit: Connection closed for inactivity) [08:46] *** Asparagir has quit IRC (Asparagir) [09:49] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [11:07] *** slyphic has quit IRC (Read error: Operation timed out) [11:13] *** slyphic has joined #archiveteam-bs [11:42] *** ZexaronS- has quit IRC (Quit: Leaving) [12:31] *** altlabel has joined #archiveteam-bs [12:35] Anyone else getting a lot of errors when accessing the Wayback Machine? I get "unable to connect", timeouts, pages which never finish loading, etc. [12:46] *** sep332_ has joined #archiveteam-bs [12:47] *** sep332 has quit IRC (Read error: Operation timed out) [13:30] *** jacketcha has quit IRC (Read error: Connection reset by peer) [13:48] Seems to be better now. [15:20] *** Mateon1 has quit IRC (Ping timeout: 255 seconds) [15:20] *** Mateon1 has joined #archiveteam-bs [15:33] *** jdude104 has joined #archiveteam-bs [15:46] *** jdude104 has quit IRC (Quit: Leaving) [16:37] *** schbirid has joined #archiveteam-bs [17:33] *** RichardG_ has joined #archiveteam-bs [17:33] *** RichardG has quit IRC (Read error: Connection reset by peer) [17:47] *** RichardG_ has quit IRC (Read error: Connection reset by peer) [17:50] *** jschwart has joined #archiveteam-bs [17:53] *** RichardG has joined #archiveteam-bs [17:54] *** RichardG has quit IRC (Read error: Connection reset by peer) [17:57] *** RichardG has joined #archiveteam-bs [18:13] I'm cleaning WARCs still [18:13] https://archive.org/details/archiveteam_miiverse is getting that massive miiverse grab [18:14] https://archive.org/details/warczone now exists. It is "outsider" WARCs, WARCs where we have no idea who is sending them. There's a good chance they won't go directly into Wayback. [18:15] *** ReimuHaku has quit IRC (Ping timeout: 250 seconds) [18:17] *** ReimuHaku has joined #archiveteam-bs [18:17] *** RichardG has quit IRC (Read error: Connection reset by peer) [18:17] *** RichardG has joined #archiveteam-bs [18:22] *** K4k has joined #archiveteam-bs [18:23] SketchCow: thats a damn fine pun you made there [18:46] https://archive.org/details/archiveteam_yahoogroups is about to get super huge [18:48] *** adinbied has joined #archiveteam-bs [18:50] *** RichardG has quit IRC (Read error: Connection reset by peer) [18:50] Hi all, I seem to have lost the link to the Discord server, can anyone send it to me? A while back I asked about archiving Gazelle-based sites, and got linked to the discord server to talk to -Archivist-, as he/she was working on that at the time. Thanks! [18:51] *** RichardG has joined #archiveteam-bs [18:53] adinbied: If it was posted in here, try searching the logs: http://archive.fart.website/bin/irclogger_logs [18:55] Found it, thanks! [18:59] *** adinbied has quit IRC (Quit: Page closed) [19:35] *** ndiddy_ has quit IRC () [19:38] *** ndiddy has joined #archiveteam-bs [20:01] *** REiN^ has quit IRC (no.money.no.love) [20:02] *** purplebot has quit IRC (Ping timeout: 248 seconds) [20:02] SketchCow: Can I get permission to upload to that collection? [20:02] *** HCross2 has quit IRC (Ping timeout: 248 seconds) [20:03] *** Rai-chan has quit IRC (Ping timeout: 248 seconds) [20:03] *** i0npulse has quit IRC (Ping timeout: 248 seconds) [20:09] *** RichardG has quit IRC (Ping timeout: 248 seconds) [20:10] *** RichardG has joined #archiveteam-bs [20:17] *** AeonG_ has joined #archiveteam-bs [20:22] *** Caz has quit IRC (Read error: Operation timed out) [20:30] *** purplebot has joined #archiveteam-bs [20:30] *** i0npulse has joined #archiveteam-bs [20:31] *** HCross2 has joined #archiveteam-bs [20:31] *** svchfoo1 sets mode: +o HCross2 [20:33] *** Rai-chan has joined #archiveteam-bs [20:33] *** odemg has quit IRC (Read error: Operation timed out) [20:36] *** odemg has joined #archiveteam-bs [20:41] I don't see why not, you're one of the processes I've got cleaning up [20:43] You now have archiveteam_yahoogroups. You might need to log out of your browser to get it noticed. [20:50] SketchCow: Thanks for moving those miiverse files [20:52] Kinda realized part of the way through making them that I _probably_ should have less of them, rather than 10,000 post chunks. [20:52] But, hey, it's easier for people to download a 200 MB warc than multiple terabytes if they just need one post ;) [20:55] *** octothorp has quit IRC (Read error: Connection reset by peer) [20:59] *** Rai-chan has quit IRC (Ping timeout: 248 seconds) [21:01] *** HCross2 has quit IRC (Ping timeout: 248 seconds) [21:01] *** purplebot has quit IRC (Ping timeout: 248 seconds) [21:02] SketchCow: my cat throw up on one of your boxes [21:03] I NEED TO GET LABELS NOW SO I CAN MAIL THEM BEFORE THE CAT RUINS YOUR STUFF [21:06] *** i0npulse has quit IRC (Ping timeout: 248 seconds) [21:06] tapes are fine but box has dry cat vomit on it [21:08] *** SketchCow changes topic to: Lengthy Archive Team related discussions here | General archiving & offtopic: #archiveteam-ot | < godane> SketchCow: my cat throw up on one of your boxes [21:08] Let me get on that [21:08] DrasticAc: yes, if I'd had more of a say on your project, I'd have said you should have 50gb per item [21:09] Yeah, it was one of those things I didn't know until it was too late to switch. [21:10] But next time, I have a better idea of what to do. [21:10] godane: i am glad I am not the only one with that problem. My cats puke on stuff all the time ¬_¬ [21:11] I don't know if it'll be useful, but I was thinking of making a mini-archivebot for stuff like Slack or Discord. [21:12] https://archive.org/details/archiveteam_verizon [21:12] You can see my script slowly adding a filler logo to all the items [21:12] Since it seems like a portion of stuff that gets submitted to archivebot are one-off sites (like twitter links), having something like that available more widely might be useful. [21:13] lgloo: lucky for most of my stuff is in my room [21:13] Although I guess you can use the IA extension for that. [21:13] The problem is that people are not very good at assessing archivebot [21:13] and the cat doesn't come into my room [21:13] but there is no room for boxes in my room [21:13] And we get people doing things like "hurr durr The Onion is pretty amazeballs, I better kick off a million-url job with one line because just in case" [21:14] "Hey, someone mirrored a mirror of a mirror we mirror, better get THAT copy too" [21:14] We are trying to police that much better though.... [21:14] We are [21:14] Adding it to random discords or slacks would not be smark [21:14] Could keep a database to check against that though. [21:14] I'd kill any link [21:14] Like, if x link was already archived, don't do it again. [21:14] Drop to a whitelist of people who can kick off jobs [21:15] DrasticAc we do that. But it's just a bit broken at the moment. If you want to help us fix it we'd appreciate it ;-) [21:15] AB is a victim of it's own success. [21:16] Just saying. Don't make more links to archivebot [21:16] in other news i got my archivebox rpi project to broadcast a 'honeypot' wifi [21:16] Or things that can kick off archivebot to an even larger set of feel-no-pain instigators [21:17] Oh no, I'm not saying make a slack bot that talks to _our_ archivebot. [21:17] next part of my project is to add a local wayback machine to it [21:17] I'm saying "make something totally different that offers a limited set of its functions" [21:17] Oh, here's a project I was thinking about that someone should do. [21:17] Ready? [21:17] You seem to all be quite capable of this. [21:17] A little package, that if you drop it in a directory, and the directory has WARCs, you get a little mini wayback for it [21:18] Which maybe a navigatron option for the family of URLs it covers [21:19] So, Something that can run on any server? and provide a way back feel for the warcs in that directory? [21:19] Yes. [21:19] Or a subdirectory, I guess [21:19] WARCS/ [21:19] Interesting, I like the idea of that [21:20] Yeah, that sounds very useful [21:20] Do it [21:20] waiting [21:20] * SketchCow taps watch [21:21] Just wait till I get off of work, have dinner, etc. [21:24] https://www.youtube.com/watch?v=af3mlZ28MzI [21:24] << I love that film >> [21:36] *** purplebot has joined #archiveteam-bs [21:43] *** purplebot has quit IRC (hub.dk irc.underworld.no) [22:12] *** k_o has joined #archiveteam-bs [22:14] *** Jon has joined #archiveteam-bs [22:16] hmm. I've got a blu ray, CC-BY-SA-NC, but it is DRM protected. I would like to put it on archive.org but not sure whether to put it up with or without the DRM. Also a prior upload by someone else years back got deleted without explanation [22:17] do you have a link to this prior upload? it was probably darked because the copyright holder complained. i can check though. [22:22] *** octothorp has joined #archiveteam-bs [22:22] k_o: VSCO will be quite annoying to archive with all that JS going on. If you could write up a summary of what the site structure is like and how the content can be accessed, that would be great. [22:23] Looks like they don't use numeric IDs though, so iterating over everything won't be easy. [22:23] Oh, the site is one of the worst things I've ever seen. [22:23] I've got two scripts that can download it, though. [22:24] That's definitely also helpful, yes. [22:24] The one I prefer is from github and it's written in ruby [22:24] Lemme find the link [22:24] (Ugh, Ruby. ;-) ) [22:24] https://github.com/HuggableSquare/vsco-dl Well, the other one I wrote in Python, but it's a good deal slower than this one, and doesn't get nearly as much metadata [22:25] This puts everything in a folder, but the naming is pretty crap, so I wrote a Python script to rename the files to the year, month, and day [22:26] After that I run packjpg to compress everything to about 75% and then pack it into .tar.bz2 archives [22:26] Well, we usually archive in the WARC format if possible. [22:26] I'm not too familiar with WARC, so some changes would probably be necessary there [22:27] What vsco-dl does should be fairly easy to do with a plugin for wget-lua or wpull. [22:27] Yeah, the problem is that I'm averaging 220MB/user right now [22:27] My current list is 150,000 names and growing, so it's already in the 30 TB range, which is more space than I have [22:27] Any idea how large it is in total? [22:27] Ah [22:28] The thing is, VSCO reported 30 million active monthly users last year [22:28] So it's probably in the petabytes range at least [22:28] *** jschwart has quit IRC (Konversation terminated!) [22:28] Hmm, that seems way too large for a photo sharing website. [22:29] Vidme and SoundCloud are in that range. [22:29] (Well, Vidme was and SC is.) [22:29] Exactly, vidme *was* [22:29] and SC was threatening to go under [22:29] Yeah [22:29] hence my concern [22:30] what happened to SC, anyway? did they find new funding? [22:30] Right, but I can't believe that VSCO gets even close to 1 PB. [22:31] I'm not sure what IA thinks about grabbing a copy of them though. [22:31] I mean the 30 million thing is pretty widely reported https://finance.yahoo.com/news/vsco-now-30-million-active-170002551.html [22:31] That's actually the only info I can find about their stats. No user info since then, no size info, no quarterly reports. [22:32] I'm not even really sure how they make money, there's no articles about it on the first pages of search. [22:32] But yeah, there's the issue of privacy and all that. I remember the Instagram project got a lot of bad press [22:32] IA may not want that [22:35] Anyways, I thought I'd float the idea to archiveteam, see if anyone was interested [22:35] *** purplebot has joined #archiveteam-bs [22:36] *** Rai-chan has joined #archiveteam-bs [22:36] Looks like you can purchase something called "VSCO Film"? [22:36] There's no immediate danger, but I remember how short notice on vidme meant we couldn't save all of it [22:36] Hard to imagine how one product could bring in enough cash to host as much data as they do [22:37] Who knows, though, they don't seem to post earnings or anything [22:37] Yeah, it's nice to have an idea of how the site works etc. already so we can grab it quickly when they announce the shutdown. [22:37] astrid, yeah, thanks -- it was http://archive.org/details/NineInchNailsGhostsI-Ivblu-ray24bit96khz$ [22:37] minus the $ http://archive.org/details/NineInchNailsGhostsI-Ivblu-ray24bit96khz [22:37] *** i0npulse has joined #archiveteam-bs [22:37] right [22:37] astrid: the album is widely available in 16/44.1 (including several times on archive.org); in 24/96 (as on the BD) it's much rarer. I just sourced one after 10 years or so, and it cost me £50 [22:38] despite that it's still clearly marked as CC-BY-SA-NC [22:38] that was darked in december 2014 with the comment "possible rights issues" [22:38] email info@archive.org and maybe they'll un-dark it [22:38] thanks, I shall. Can you tell if I was the original uploader? I've completely forgotten. My username is jmtd on archive.org [22:38] thanks for all your help [22:38] k_o: Apparently you can also buy filters and possibly other stuff through an in-app store. The famous microtransactions scheme. [22:39] original uploader was someone with email address 893productions@gmail.com [22:39] ok yeah that wasn't me. Thanks :> [22:39] In that case, their business model may be sound [22:39] I'll still email [22:39] sure thing Jon [22:39] I figured it's a website worth keeping an eye on though [22:39] k_o: Sure. Are you willing to share your code for scraping users? [22:39] * Jon goes to bed [22:40] Sure, it's written in python and uses selenium [22:40] I can put it up on pastebin [22:41] It's probably not the most efficient way to go about it, but I don't know how else to render their crappy website except for a headless browser [22:42] Yeah, it should be a lot faster to just do the relevant API requests directly. [22:43] I'm interested in seeing the code anyway, also because I wanted to look into headless browsers for archiving before. [22:45] *** k_o_ has joined #archiveteam-bs [22:45] internet crashed [22:45] idk if the message got through, I'll upload the code to pastebin [22:45] *** k_o has quit IRC (Ping timeout: 260 seconds) [22:46] i get to have fun setting up my new comcast cable modem latter [22:46] k_o_: Here's what happened: http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2018-01-09,Tue&sel=229#l225 [22:47] Alright, that's all the messages I sent [22:47] Gimme a sec to cut out the code and put it up [22:50] https://pastebin.com/au6eSN39 [22:51] You start if off by creating a file vsco.txt with [22:51] At least one username and a "|" before the first username [22:51] It searches the collection for each user and adds those names to the file, going through all of the new names, so theoretically it will eventually scrape every non-orphan user on the site [22:52] If you need to break the script, just move the | back to the point you want it, and it won't search through the first names again [22:52] It also checks for duplicates and won't add those, so each username is unique [22:53] Ah, collections, I see. [22:53] Thanks [22:54] My vsco.txt is slightly over 157,000 lines currently, but with 30 million active users, that's barely half a percent [22:55] It's been running for about a day, so given a few weeks, it could probably build up a pretty good list [22:55] I figured it would be helpful to have around if/when there's a shutdown notice [22:56] Indeed [23:07] Another idea to discover users would be to search for tags appearing on the individual photo pages. [23:10] I think most of the people who are tagged also appear on the collection, but I could be wrong [23:11] If the script I'm running finishes with a lot of users missing, I could try that [23:29] *** BlueMaxim has joined #archiveteam-bs [23:57] *** wbradley has quit IRC (WeeChat 1.4)