[00:03] *** ranav has quit IRC (Read error: Connection reset by peer)
[00:14] *** ranavalon has joined #archiveteam-bs
[00:14] *** ranavalon has quit IRC (Remote host closed the connection)
[00:15] *** ranavalon has joined #archiveteam-bs
[00:18] *** BlueMaxim has quit IRC (Leaving)
[01:00] *** ranavalon has quit IRC (Quit: Leaving)
[01:15] *** BlueMaxim has joined #archiveteam-bs
[01:42] *** yuitimoth has quit IRC (Read error: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac)
[01:42] *** yuitimoth has joined #archiveteam-bs
[01:54] *** yuitimoth has quit IRC (Read error: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac)
[01:54] *** yuitimoth has joined #archiveteam-bs
[02:12] *** DFJustin has quit IRC (Remote host closed the connection)
[02:15] *** DFJustin has joined #archiveteam-bs
[02:15] *** swebb sets mode: +o DFJustin
[02:38] <bithippo> Is it possible to edit an item's collection it belongs to after creating said item?
[02:43] *** bithippo has quit IRC (Ping timeout: 260 seconds)
[03:29] *** atlogbot has quit IRC (Read error: Operation timed out)
[03:29] *** swebb has quit IRC (Read error: Operation timed out)
[03:30] *** swebb has joined #archiveteam-bs
[03:30] *** atlogbot has joined #archiveteam-bs
[03:30] *** svchfoo3 sets mode: +o swebb
[03:30] *** svchost03 sets mode: +v atlogbot
[04:46] *** jdude104 has quit IRC (Read error: Operation timed out)
[04:49] *** qw3rty14 has joined #archiveteam-bs
[04:53] *** qw3rty13 has quit IRC (Read error: Operation timed out)
[05:05] *** K4k has quit IRC (Read error: Connection reset by peer)
[05:42] *** w0rp has quit IRC (Ping timeout: 245 seconds)
[05:45] *** w0rp has joined #archiveteam-bs
[06:28] *** zyphlar has joined #archiveteam-bs
[07:04] *** sekolyn has joined #archiveteam-bs
[07:05] *** octothorp has quit IRC (Read error: Operation timed out)
[07:28] *** octothorp has joined #archiveteam-bs
[07:29] *** sekolyn has quit IRC (Read error: Operation timed out)
[07:29] *** kpz has joined #archiveteam-bs
[07:30] *** kpz has left 
[07:44] *** Asparagir has joined #archiveteam-bs
[08:38] *** zyphlar has quit IRC (Quit: Connection closed for inactivity)
[08:46] *** Asparagir has quit IRC (Asparagir)
[09:49] *** BlueMaxim has quit IRC (Read error: Connection reset by peer)
[11:07] *** slyphic has quit IRC (Read error: Operation timed out)
[11:13] *** slyphic has joined #archiveteam-bs
[11:42] *** ZexaronS- has quit IRC (Quit: Leaving)
[12:31] *** altlabel has joined #archiveteam-bs
[12:35] <JAA> Anyone else getting a lot of errors when accessing the Wayback Machine? I get "unable to connect", timeouts, pages which never finish loading, etc.
[12:46] *** sep332_ has joined #archiveteam-bs
[12:47] *** sep332 has quit IRC (Read error: Operation timed out)
[13:30] *** jacketcha has quit IRC (Read error: Connection reset by peer)
[13:48] <JAA> Seems to be better now.
[15:20] *** Mateon1 has quit IRC (Ping timeout: 255 seconds)
[15:20] *** Mateon1 has joined #archiveteam-bs
[15:33] *** jdude104 has joined #archiveteam-bs
[15:46] *** jdude104 has quit IRC (Quit: Leaving)
[16:37] *** schbirid has joined #archiveteam-bs
[17:33] *** RichardG_ has joined #archiveteam-bs
[17:33] *** RichardG has quit IRC (Read error: Connection reset by peer)
[17:47] *** RichardG_ has quit IRC (Read error: Connection reset by peer)
[17:50] *** jschwart has joined #archiveteam-bs
[17:53] *** RichardG has joined #archiveteam-bs
[17:54] *** RichardG has quit IRC (Read error: Connection reset by peer)
[17:57] *** RichardG has joined #archiveteam-bs
[18:13] <SketchCow> I'm cleaning WARCs still
[18:13] <SketchCow> https://archive.org/details/archiveteam_miiverse is getting that massive miiverse grab
[18:14] <SketchCow> https://archive.org/details/warczone now exists. It is "outsider" WARCs, WARCs where we have no idea who is sending them. There's a good chance they won't go directly into Wayback.
[18:15] *** ReimuHaku has quit IRC (Ping timeout: 250 seconds)
[18:17] *** ReimuHaku has joined #archiveteam-bs
[18:17] *** RichardG has quit IRC (Read error: Connection reset by peer)
[18:17] *** RichardG has joined #archiveteam-bs
[18:22] *** K4k has joined #archiveteam-bs
[18:23] <jrwr> SketchCow: thats a damn fine pun you made there
[18:46] <SketchCow> https://archive.org/details/archiveteam_yahoogroups is about to get super huge
[18:48] *** adinbied has joined #archiveteam-bs
[18:50] *** RichardG has quit IRC (Read error: Connection reset by peer)
[18:50] <adinbied> Hi all, I seem to have lost the link to the Discord server, can anyone send it to me? A while back I asked about archiving Gazelle-based sites, and got linked to the discord server to talk to -Archivist-, as he/she was working on that at the time. Thanks!
[18:51] *** RichardG has joined #archiveteam-bs
[18:53] <JAA> adinbied: If it was posted in here, try searching the logs: http://archive.fart.website/bin/irclogger_logs
[18:55] <adinbied> Found it, thanks!
[18:59] *** adinbied has quit IRC (Quit: Page closed)
[19:35] *** ndiddy_ has quit IRC ()
[19:38] *** ndiddy has joined #archiveteam-bs
[20:01] *** REiN^ has quit IRC (no.money.no.love)
[20:02] *** purplebot has quit IRC (Ping timeout: 248 seconds)
[20:02] <PurpleSym> SketchCow: Can I get permission to upload to that collection?
[20:02] *** HCross2 has quit IRC (Ping timeout: 248 seconds)
[20:03] *** Rai-chan has quit IRC (Ping timeout: 248 seconds)
[20:03] *** i0npulse has quit IRC (Ping timeout: 248 seconds)
[20:09] *** RichardG has quit IRC (Ping timeout: 248 seconds)
[20:10] *** RichardG has joined #archiveteam-bs
[20:17] *** AeonG_ has joined #archiveteam-bs
[20:22] *** Caz has quit IRC (Read error: Operation timed out)
[20:30] *** purplebot has joined #archiveteam-bs
[20:30] *** i0npulse has joined #archiveteam-bs
[20:31] *** HCross2 has joined #archiveteam-bs
[20:31] *** svchfoo1 sets mode: +o HCross2
[20:33] *** Rai-chan has joined #archiveteam-bs
[20:33] *** odemg has quit IRC (Read error: Operation timed out)
[20:36] *** odemg has joined #archiveteam-bs
[20:41] <SketchCow> I don't see why not, you're one of the processes I've got cleaning up
[20:43] <SketchCow> You now have archiveteam_yahoogroups. You might need to log out of your browser to get it noticed.
[20:50] <DrasticAc> SketchCow: Thanks for moving those miiverse files
[20:52] <DrasticAc> Kinda realized part of the way through making them that I _probably_ should have less of them, rather than 10,000 post chunks.
[20:52] <DrasticAc> But, hey, it's easier for people to download a 200 MB warc than multiple terabytes if they just need one post ;)
[20:55] *** octothorp has quit IRC (Read error: Connection reset by peer)
[20:59] *** Rai-chan has quit IRC (Ping timeout: 248 seconds)
[21:01] *** HCross2 has quit IRC (Ping timeout: 248 seconds)
[21:01] *** purplebot has quit IRC (Ping timeout: 248 seconds)
[21:02] <godane> SketchCow: my cat throw up on one of your boxes
[21:03] <godane> I NEED TO GET LABELS NOW SO I CAN MAIL THEM BEFORE THE CAT RUINS YOUR STUFF
[21:06] *** i0npulse has quit IRC (Ping timeout: 248 seconds)
[21:06] <godane> tapes are fine but box has dry cat vomit on it
[21:08] *** SketchCow changes topic to: Lengthy Archive Team related discussions here | General archiving & offtopic: #archiveteam-ot | < godane> SketchCow: my cat throw up on one of your boxes       
[21:08] <SketchCow> Let me get on that
[21:08] <SketchCow> DrasticAc: yes, if I'd had more of a say on your project, I'd have said you should have 50gb per item
[21:09] <DrasticAc> Yeah, it was one of those things I didn't know until it was too late to switch.
[21:10] <DrasticAc> But next time, I have a better idea of what to do.
[21:10] <Igloo> godane: i am glad I am not the only one with that problem. My cats puke on stuff all the time ¬_¬
[21:11] <DrasticAc> I don't know if it'll be useful, but I was thinking of making a mini-archivebot for stuff like Slack or Discord.
[21:12] <SketchCow> https://archive.org/details/archiveteam_verizon
[21:12] <SketchCow> You can see my script slowly adding a filler logo to all the items
[21:12] <DrasticAc> Since it seems like a portion of stuff that gets submitted to archivebot are one-off sites (like twitter links), having something like that available more widely might be useful.
[21:13] <godane> lgloo: lucky for most of my stuff is in my room
[21:13] <DrasticAc> Although I guess you can use the IA extension for that.
[21:13] <SketchCow> The problem is that people are not very good at assessing archivebot
[21:13] <godane> and the cat doesn't come into my room
[21:13] <godane> but there is no room for boxes in my room
[21:13] <SketchCow> And we get people doing things like "hurr durr The Onion is pretty amazeballs, I better kick off a million-url job with one line because just in case"
[21:14] <SketchCow> "Hey, someone mirrored a mirror of a mirror we mirror, better get THAT copy too"
[21:14] <Igloo> We are trying to police that much better though....
[21:14] <SketchCow> We are
[21:14] <SketchCow> Adding it to random discords or slacks would not be smark
[21:14] <DrasticAc> Could keep a database to check against that though.
[21:14] <SketchCow> I'd kill any link
[21:14] <DrasticAc> Like, if x link was already archived, don't do it again.
[21:14] <SketchCow> Drop to a whitelist of people who can kick off jobs
[21:15] <Igloo> DrasticAc we do that. But it's just a bit broken at the moment. If you want to help us fix it we'd appreciate it ;-)
[21:15] <Igloo> AB is a victim of it's own success.
[21:16] <SketchCow> Just saying. Don't make more links to archivebot
[21:16] <godane> in other news i got my archivebox rpi project to broadcast a 'honeypot' wifi
[21:16] <SketchCow> Or things that can kick off archivebot to an even larger set of feel-no-pain instigators
[21:17] <DrasticAc> Oh no, I'm not saying make a slack bot that talks to _our_ archivebot.
[21:17] <godane> next part of my project is to add a local wayback machine to it
[21:17] <DrasticAc> I'm saying "make something totally different that offers a limited set of its functions"
[21:17] <SketchCow> Oh, here's a project I was thinking about that someone should do.
[21:17] <SketchCow> Ready?
[21:17] <SketchCow> You seem to all be quite capable of this.
[21:17] <SketchCow> A little package, that if you drop it in a directory, and the directory has WARCs, you get a little mini wayback for it
[21:18] <SketchCow> Which maybe a navigatron option for the family of URLs it covers
[21:19] <Igloo> So, Something that can run on any server? and provide a way back feel for the warcs in that directory?
[21:19] <SketchCow> Yes.
[21:19] <SketchCow> Or a subdirectory, I guess
[21:19] <SketchCow> WARCS/
[21:19] <Igloo> Interesting, I like the idea of that
[21:20] <DrasticAc> Yeah, that sounds very useful
[21:20] <SketchCow> Do it
[21:20] <SketchCow> waiting
[21:20] * SketchCow taps watch
[21:21] <DrasticAc> Just wait till I get off of work, have dinner, etc.
[21:24] <SketchCow> https://www.youtube.com/watch?v=af3mlZ28MzI
[21:24] <Igloo> << I love that film >> 
[21:36] *** purplebot has joined #archiveteam-bs
[21:43] *** purplebot has quit IRC (hub.dk irc.underworld.no)
[22:12] *** k_o has joined #archiveteam-bs
[22:14] *** Jon has joined #archiveteam-bs
[22:16] <Jon> hmm. I've got a blu ray, CC-BY-SA-NC, but it is DRM protected. I would like to put it on archive.org but not sure whether to put it up with or without the DRM. Also a prior upload by someone else years back got deleted without explanation
[22:17] <astrid> do you have a link to this prior upload? it was probably darked because the copyright holder complained. i can check though.
[22:22] *** octothorp has joined #archiveteam-bs
[22:22] <JAA> k_o: VSCO will be quite annoying to archive with all that JS going on. If you could write up a summary of what the site structure is like and how the content can be accessed, that would be great.
[22:23] <JAA> Looks like they don't use numeric IDs though, so iterating over everything won't be easy.
[22:23] <k_o> Oh, the site is one of the worst things I've ever seen.
[22:23] <k_o> I've got two scripts that can download it, though.
[22:24] <JAA> That's definitely also helpful, yes.
[22:24] <k_o> The one I prefer is from github and it's written in ruby
[22:24] <k_o> Lemme find the link
[22:24] <JAA> (Ugh, Ruby. ;-) )
[22:24] <k_o> https://github.com/HuggableSquare/vsco-dl   Well, the other one I wrote in Python, but it's a good deal slower than this one, and doesn't get nearly as much metadata
[22:25] <k_o> This puts everything in a folder, but the naming is pretty crap, so I wrote a Python script to rename the files to the year, month, and day
[22:26] <k_o> After that I run packjpg to compress everything to about 75% and then pack it into .tar.bz2 archives
[22:26] <JAA> Well, we usually archive in the WARC format if possible.
[22:26] <k_o> I'm not too familiar with WARC, so some changes would probably be necessary there
[22:27] <JAA> What vsco-dl does should be fairly easy to do with a plugin for wget-lua or wpull.
[22:27] <k_o> Yeah, the problem is that I'm averaging 220MB/user right now
[22:27] <k_o> My current list is 150,000 names and growing, so it's already in the 30 TB range, which is more space than I have
[22:27] <JAA> Any idea how large it is in total?
[22:27] <JAA> Ah
[22:28] <k_o> The thing is, VSCO reported 30 million active monthly users last year
[22:28] <k_o> So it's probably in the petabytes range at least
[22:28] *** jschwart has quit IRC (Konversation terminated!)
[22:28] <JAA> Hmm, that seems way too large for a photo sharing website.
[22:29] <JAA> Vidme and SoundCloud are in that range.
[22:29] <JAA> (Well, Vidme was and SC is.)
[22:29] <k_o> Exactly, vidme *was*
[22:29] <k_o> and SC was threatening to go under
[22:29] <JAA> Yeah
[22:29] <k_o> hence my concern
[22:30] <k_o> what happened to SC, anyway? did they find new funding?
[22:30] <JAA> Right, but I can't believe that VSCO gets even close to 1 PB.
[22:31] <JAA> I'm not sure what IA thinks about grabbing a copy of them though.
[22:31] <k_o> I mean the 30 million thing is pretty widely reported https://finance.yahoo.com/news/vsco-now-30-million-active-170002551.html
[22:31] <k_o> That's actually the only info I can find about their stats. No user info since then, no size info, no quarterly reports.
[22:32] <k_o> I'm not even really sure how they make money, there's no articles about it on the first pages of search.
[22:32] <k_o> But yeah, there's the issue of privacy and all that. I remember the Instagram project got a lot of bad press
[22:32] <k_o> IA may not want that
[22:35] <k_o> Anyways, I thought I'd float the idea to archiveteam, see if anyone was interested
[22:35] *** purplebot has joined #archiveteam-bs
[22:36] *** Rai-chan has joined #archiveteam-bs
[22:36] <JAA> Looks like you can purchase something called "VSCO Film"?
[22:36] <k_o> There's no immediate danger, but I remember how short notice on vidme meant we couldn't save all of it
[22:36] <k_o> Hard to imagine how one product could bring in enough cash to host as much data as they do
[22:37] <k_o> Who knows, though, they don't seem to post earnings or anything
[22:37] <JAA> Yeah, it's nice to have an idea of how the site works etc. already so we can grab it quickly when they announce the shutdown.
[22:37] <Jon> astrid, yeah, thanks -- it was http://archive.org/details/NineInchNailsGhostsI-Ivblu-ray24bit96khz$
[22:37] <Jon> minus the $ http://archive.org/details/NineInchNailsGhostsI-Ivblu-ray24bit96khz
[22:37] *** i0npulse has joined #archiveteam-bs
[22:37] <astrid> right
[22:37] <Jon> astrid: the album is widely available in 16/44.1 (including several times on archive.org); in 24/96 (as on the BD) it's much rarer. I just sourced one after 10 years or so, and it cost me £50
[22:38] <Jon> despite that it's still clearly marked as CC-BY-SA-NC
[22:38] <astrid> that was darked in december 2014 with the comment "possible rights issues"
[22:38] <astrid> email info@archive.org and maybe they'll un-dark it
[22:38] <Jon> thanks, I shall. Can you tell if I was the original uploader? I've completely forgotten. My username is jmtd on archive.org
[22:38] <Jon> thanks for all your help
[22:38] <JAA> k_o: Apparently you can also buy filters and possibly other stuff through an in-app store. The famous microtransactions scheme.
[22:39] <astrid> original uploader was someone with email address 893productions@gmail.com
[22:39] <Jon> ok yeah that wasn't me. Thanks :>
[22:39] <k_o> In that case, their business model may be sound
[22:39] <Jon> I'll still email
[22:39] <astrid> sure thing Jon
[22:39] <k_o> I figured it's a website worth keeping an eye on though
[22:39] <JAA> k_o: Sure. Are you willing to share your code for scraping users?
[22:39] * Jon goes to bed
[22:40] <k_o> Sure, it's written in python and uses selenium
[22:40] <k_o> I can put it up on pastebin
[22:41] <k_o> It's probably not the most efficient way to go about it, but I don't know how else to render their crappy website except for a headless browser
[22:42] <JAA> Yeah, it should be a lot faster to just do the relevant API requests directly.
[22:43] <JAA> I'm interested in seeing the code anyway, also because I wanted to look into headless browsers for archiving before.
[22:45] *** k_o_ has joined #archiveteam-bs
[22:45] <k_o_> internet crashed
[22:45] <k_o_> idk if the message got through, I'll upload the code to pastebin
[22:45] *** k_o has quit IRC (Ping timeout: 260 seconds)
[22:46] <godane> i get to have fun setting up my new comcast cable modem latter
[22:46] <JAA> k_o_: Here's what happened: http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2018-01-09,Tue&sel=229#l225
[22:47] <k_o_> Alright, that's all the messages I sent
[22:47] <k_o_> Gimme a sec to cut out the code and put it up
[22:50] <k_o_> https://pastebin.com/au6eSN39
[22:51] <k_o_> You start if off by creating a file vsco.txt with
[22:51] <k_o_> At least one username and a "|" before the first username
[22:51] <k_o_> It searches the collection for each user and adds those names to the file, going through all of the new names, so theoretically it will eventually scrape every non-orphan user on the site
[22:52] <k_o_> If you need to break the script, just move the | back to the point you want it, and it won't search through the first names again
[22:52] <k_o_> It also checks for duplicates and won't add those, so each username is unique
[22:53] <JAA> Ah, collections, I see.
[22:53] <JAA> Thanks
[22:54] <k_o_> My vsco.txt is slightly over 157,000 lines currently, but with 30 million active users, that's barely half a percent
[22:55] <k_o_> It's been running for about a day, so given a few weeks, it could probably build up a pretty good list
[22:55] <k_o_> I figured it would be helpful to have around if/when there's a shutdown notice
[22:56] <JAA> Indeed
[23:07] <JAA> Another idea to discover users would be to search for tags appearing on the individual photo pages.
[23:10] <k_o_> I think most of the people who are tagged also appear on the collection, but I could be wrong
[23:11] <k_o_> If the script I'm running finishes with a lot of users missing, I could try that
[23:29] *** BlueMaxim has joined #archiveteam-bs
[23:57] *** wbradley has quit IRC (WeeChat 1.4)