#archiveteam-bs 2018-01-09,Tue

↑back Search

Time Nickname Message
00:03 🔗 ranav has quit IRC (Read error: Connection reset by peer)
00:14 🔗 ranavalon has joined #archiveteam-bs
00:14 🔗 ranavalon has quit IRC (Remote host closed the connection)
00:15 🔗 ranavalon has joined #archiveteam-bs
00:18 🔗 BlueMaxim has quit IRC (Leaving)
01:00 🔗 ranavalon has quit IRC (Quit: Leaving)
01:15 🔗 BlueMaxim has joined #archiveteam-bs
01:42 🔗 yuitimoth has quit IRC (Read error: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac)
01:42 🔗 yuitimoth has joined #archiveteam-bs
01:54 🔗 yuitimoth has quit IRC (Read error: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac)
01:54 🔗 yuitimoth has joined #archiveteam-bs
02:12 🔗 DFJustin has quit IRC (Remote host closed the connection)
02:15 🔗 DFJustin has joined #archiveteam-bs
02:15 🔗 swebb sets mode: +o DFJustin
02:38 🔗 bithippo Is it possible to edit an item's collection it belongs to after creating said item?
02:43 🔗 bithippo has quit IRC (Ping timeout: 260 seconds)
03:29 🔗 atlogbot has quit IRC (Read error: Operation timed out)
03:29 🔗 swebb has quit IRC (Read error: Operation timed out)
03:30 🔗 swebb has joined #archiveteam-bs
03:30 🔗 atlogbot has joined #archiveteam-bs
03:30 🔗 svchfoo3 sets mode: +o swebb
03:30 🔗 svchost03 sets mode: +v atlogbot
04:46 🔗 jdude104 has quit IRC (Read error: Operation timed out)
04:49 🔗 qw3rty14 has joined #archiveteam-bs
04:53 🔗 qw3rty13 has quit IRC (Read error: Operation timed out)
05:05 🔗 K4k has quit IRC (Read error: Connection reset by peer)
05:42 🔗 w0rp has quit IRC (Ping timeout: 245 seconds)
05:45 🔗 w0rp has joined #archiveteam-bs
06:28 🔗 zyphlar has joined #archiveteam-bs
07:04 🔗 sekolyn has joined #archiveteam-bs
07:05 🔗 octothorp has quit IRC (Read error: Operation timed out)
07:28 🔗 octothorp has joined #archiveteam-bs
07:29 🔗 sekolyn has quit IRC (Read error: Operation timed out)
07:29 🔗 kpz has joined #archiveteam-bs
07:30 🔗 kpz has left
07:44 🔗 Asparagir has joined #archiveteam-bs
08:38 🔗 zyphlar has quit IRC (Quit: Connection closed for inactivity)
08:46 🔗 Asparagir has quit IRC (Asparagir)
09:49 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
11:07 🔗 slyphic has quit IRC (Read error: Operation timed out)
11:13 🔗 slyphic has joined #archiveteam-bs
11:42 🔗 ZexaronS- has quit IRC (Quit: Leaving)
12:31 🔗 altlabel has joined #archiveteam-bs
12:35 🔗 JAA Anyone else getting a lot of errors when accessing the Wayback Machine? I get "unable to connect", timeouts, pages which never finish loading, etc.
12:46 🔗 sep332_ has joined #archiveteam-bs
12:47 🔗 sep332 has quit IRC (Read error: Operation timed out)
13:30 🔗 jacketcha has quit IRC (Read error: Connection reset by peer)
13:48 🔗 JAA Seems to be better now.
15:20 🔗 Mateon1 has quit IRC (Ping timeout: 255 seconds)
15:20 🔗 Mateon1 has joined #archiveteam-bs
15:33 🔗 jdude104 has joined #archiveteam-bs
15:46 🔗 jdude104 has quit IRC (Quit: Leaving)
16:37 🔗 schbirid has joined #archiveteam-bs
17:33 🔗 RichardG_ has joined #archiveteam-bs
17:33 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
17:47 🔗 RichardG_ has quit IRC (Read error: Connection reset by peer)
17:50 🔗 jschwart has joined #archiveteam-bs
17:53 🔗 RichardG has joined #archiveteam-bs
17:54 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
17:57 🔗 RichardG has joined #archiveteam-bs
18:13 🔗 SketchCow I'm cleaning WARCs still
18:13 🔗 SketchCow https://archive.org/details/archiveteam_miiverse is getting that massive miiverse grab
18:14 🔗 SketchCow https://archive.org/details/warczone now exists. It is "outsider" WARCs, WARCs where we have no idea who is sending them. There's a good chance they won't go directly into Wayback.
18:15 🔗 ReimuHaku has quit IRC (Ping timeout: 250 seconds)
18:17 🔗 ReimuHaku has joined #archiveteam-bs
18:17 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
18:17 🔗 RichardG has joined #archiveteam-bs
18:22 🔗 K4k has joined #archiveteam-bs
18:23 🔗 jrwr SketchCow: thats a damn fine pun you made there
18:46 🔗 SketchCow https://archive.org/details/archiveteam_yahoogroups is about to get super huge
18:48 🔗 adinbied has joined #archiveteam-bs
18:50 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
18:50 🔗 adinbied Hi all, I seem to have lost the link to the Discord server, can anyone send it to me? A while back I asked about archiving Gazelle-based sites, and got linked to the discord server to talk to -Archivist-, as he/she was working on that at the time. Thanks!
18:51 🔗 RichardG has joined #archiveteam-bs
18:53 🔗 JAA adinbied: If it was posted in here, try searching the logs: http://archive.fart.website/bin/irclogger_logs
18:55 🔗 adinbied Found it, thanks!
18:59 🔗 adinbied has quit IRC (Quit: Page closed)
19:35 🔗 ndiddy_ has quit IRC ()
19:38 🔗 ndiddy has joined #archiveteam-bs
20:01 🔗 REiN^ has quit IRC (no.money.no.love)
20:02 🔗 purplebot has quit IRC (Ping timeout: 248 seconds)
20:02 🔗 PurpleSym SketchCow: Can I get permission to upload to that collection?
20:02 🔗 HCross2 has quit IRC (Ping timeout: 248 seconds)
20:03 🔗 Rai-chan has quit IRC (Ping timeout: 248 seconds)
20:03 🔗 i0npulse has quit IRC (Ping timeout: 248 seconds)
20:09 🔗 RichardG has quit IRC (Ping timeout: 248 seconds)
20:10 🔗 RichardG has joined #archiveteam-bs
20:17 🔗 AeonG_ has joined #archiveteam-bs
20:22 🔗 Caz has quit IRC (Read error: Operation timed out)
20:30 🔗 purplebot has joined #archiveteam-bs
20:30 🔗 i0npulse has joined #archiveteam-bs
20:31 🔗 HCross2 has joined #archiveteam-bs
20:31 🔗 svchfoo1 sets mode: +o HCross2
20:33 🔗 Rai-chan has joined #archiveteam-bs
20:33 🔗 odemg has quit IRC (Read error: Operation timed out)
20:36 🔗 odemg has joined #archiveteam-bs
20:41 🔗 SketchCow I don't see why not, you're one of the processes I've got cleaning up
20:43 🔗 SketchCow You now have archiveteam_yahoogroups. You might need to log out of your browser to get it noticed.
20:50 🔗 DrasticAc SketchCow: Thanks for moving those miiverse files
20:52 🔗 DrasticAc Kinda realized part of the way through making them that I _probably_ should have less of them, rather than 10,000 post chunks.
20:52 🔗 DrasticAc But, hey, it's easier for people to download a 200 MB warc than multiple terabytes if they just need one post ;)
20:55 🔗 octothorp has quit IRC (Read error: Connection reset by peer)
20:59 🔗 Rai-chan has quit IRC (Ping timeout: 248 seconds)
21:01 🔗 HCross2 has quit IRC (Ping timeout: 248 seconds)
21:01 🔗 purplebot has quit IRC (Ping timeout: 248 seconds)
21:02 🔗 godane SketchCow: my cat throw up on one of your boxes
21:03 🔗 godane I NEED TO GET LABELS NOW SO I CAN MAIL THEM BEFORE THE CAT RUINS YOUR STUFF
21:06 🔗 i0npulse has quit IRC (Ping timeout: 248 seconds)
21:06 🔗 godane tapes are fine but box has dry cat vomit on it
21:08 🔗 SketchCow changes topic to: Lengthy Archive Team related discussions here | General archiving & offtopic: #archiveteam-ot | < godane> SketchCow: my cat throw up on one of your boxes
21:08 🔗 SketchCow Let me get on that
21:08 🔗 SketchCow DrasticAc: yes, if I'd had more of a say on your project, I'd have said you should have 50gb per item
21:09 🔗 DrasticAc Yeah, it was one of those things I didn't know until it was too late to switch.
21:10 🔗 DrasticAc But next time, I have a better idea of what to do.
21:10 🔗 Igloo godane: i am glad I am not the only one with that problem. My cats puke on stuff all the time ¬_¬
21:11 🔗 DrasticAc I don't know if it'll be useful, but I was thinking of making a mini-archivebot for stuff like Slack or Discord.
21:12 🔗 SketchCow https://archive.org/details/archiveteam_verizon
21:12 🔗 SketchCow You can see my script slowly adding a filler logo to all the items
21:12 🔗 DrasticAc Since it seems like a portion of stuff that gets submitted to archivebot are one-off sites (like twitter links), having something like that available more widely might be useful.
21:13 🔗 godane lgloo: lucky for most of my stuff is in my room
21:13 🔗 DrasticAc Although I guess you can use the IA extension for that.
21:13 🔗 SketchCow The problem is that people are not very good at assessing archivebot
21:13 🔗 godane and the cat doesn't come into my room
21:13 🔗 godane but there is no room for boxes in my room
21:13 🔗 SketchCow And we get people doing things like "hurr durr The Onion is pretty amazeballs, I better kick off a million-url job with one line because just in case"
21:14 🔗 SketchCow "Hey, someone mirrored a mirror of a mirror we mirror, better get THAT copy too"
21:14 🔗 Igloo We are trying to police that much better though....
21:14 🔗 SketchCow We are
21:14 🔗 SketchCow Adding it to random discords or slacks would not be smark
21:14 🔗 DrasticAc Could keep a database to check against that though.
21:14 🔗 SketchCow I'd kill any link
21:14 🔗 DrasticAc Like, if x link was already archived, don't do it again.
21:14 🔗 SketchCow Drop to a whitelist of people who can kick off jobs
21:15 🔗 Igloo DrasticAc we do that. But it's just a bit broken at the moment. If you want to help us fix it we'd appreciate it ;-)
21:15 🔗 Igloo AB is a victim of it's own success.
21:16 🔗 SketchCow Just saying. Don't make more links to archivebot
21:16 🔗 godane in other news i got my archivebox rpi project to broadcast a 'honeypot' wifi
21:16 🔗 SketchCow Or things that can kick off archivebot to an even larger set of feel-no-pain instigators
21:17 🔗 DrasticAc Oh no, I'm not saying make a slack bot that talks to _our_ archivebot.
21:17 🔗 godane next part of my project is to add a local wayback machine to it
21:17 🔗 DrasticAc I'm saying "make something totally different that offers a limited set of its functions"
21:17 🔗 SketchCow Oh, here's a project I was thinking about that someone should do.
21:17 🔗 SketchCow Ready?
21:17 🔗 SketchCow You seem to all be quite capable of this.
21:17 🔗 SketchCow A little package, that if you drop it in a directory, and the directory has WARCs, you get a little mini wayback for it
21:18 🔗 SketchCow Which maybe a navigatron option for the family of URLs it covers
21:19 🔗 Igloo So, Something that can run on any server? and provide a way back feel for the warcs in that directory?
21:19 🔗 SketchCow Yes.
21:19 🔗 SketchCow Or a subdirectory, I guess
21:19 🔗 SketchCow WARCS/
21:19 🔗 Igloo Interesting, I like the idea of that
21:20 🔗 DrasticAc Yeah, that sounds very useful
21:20 🔗 SketchCow Do it
21:20 🔗 SketchCow waiting
21:20 🔗 * SketchCow taps watch
21:21 🔗 DrasticAc Just wait till I get off of work, have dinner, etc.
21:24 🔗 SketchCow https://www.youtube.com/watch?v=af3mlZ28MzI
21:24 🔗 Igloo << I love that film >>
21:36 🔗 purplebot has joined #archiveteam-bs
21:43 🔗 purplebot has quit IRC (hub.dk irc.underworld.no)
22:12 🔗 k_o has joined #archiveteam-bs
22:14 🔗 Jon has joined #archiveteam-bs
22:16 🔗 Jon hmm. I've got a blu ray, CC-BY-SA-NC, but it is DRM protected. I would like to put it on archive.org but not sure whether to put it up with or without the DRM. Also a prior upload by someone else years back got deleted without explanation
22:17 🔗 astrid do you have a link to this prior upload? it was probably darked because the copyright holder complained. i can check though.
22:22 🔗 octothorp has joined #archiveteam-bs
22:22 🔗 JAA k_o: VSCO will be quite annoying to archive with all that JS going on. If you could write up a summary of what the site structure is like and how the content can be accessed, that would be great.
22:23 🔗 JAA Looks like they don't use numeric IDs though, so iterating over everything won't be easy.
22:23 🔗 k_o Oh, the site is one of the worst things I've ever seen.
22:23 🔗 k_o I've got two scripts that can download it, though.
22:24 🔗 JAA That's definitely also helpful, yes.
22:24 🔗 k_o The one I prefer is from github and it's written in ruby
22:24 🔗 k_o Lemme find the link
22:24 🔗 JAA (Ugh, Ruby. ;-) )
22:24 🔗 k_o https://github.com/HuggableSquare/vsco-dl Well, the other one I wrote in Python, but it's a good deal slower than this one, and doesn't get nearly as much metadata
22:25 🔗 k_o This puts everything in a folder, but the naming is pretty crap, so I wrote a Python script to rename the files to the year, month, and day
22:26 🔗 k_o After that I run packjpg to compress everything to about 75% and then pack it into .tar.bz2 archives
22:26 🔗 JAA Well, we usually archive in the WARC format if possible.
22:26 🔗 k_o I'm not too familiar with WARC, so some changes would probably be necessary there
22:27 🔗 JAA What vsco-dl does should be fairly easy to do with a plugin for wget-lua or wpull.
22:27 🔗 k_o Yeah, the problem is that I'm averaging 220MB/user right now
22:27 🔗 k_o My current list is 150,000 names and growing, so it's already in the 30 TB range, which is more space than I have
22:27 🔗 JAA Any idea how large it is in total?
22:27 🔗 JAA Ah
22:28 🔗 k_o The thing is, VSCO reported 30 million active monthly users last year
22:28 🔗 k_o So it's probably in the petabytes range at least
22:28 🔗 jschwart has quit IRC (Konversation terminated!)
22:28 🔗 JAA Hmm, that seems way too large for a photo sharing website.
22:29 🔗 JAA Vidme and SoundCloud are in that range.
22:29 🔗 JAA (Well, Vidme was and SC is.)
22:29 🔗 k_o Exactly, vidme *was*
22:29 🔗 k_o and SC was threatening to go under
22:29 🔗 JAA Yeah
22:29 🔗 k_o hence my concern
22:30 🔗 k_o what happened to SC, anyway? did they find new funding?
22:30 🔗 JAA Right, but I can't believe that VSCO gets even close to 1 PB.
22:31 🔗 JAA I'm not sure what IA thinks about grabbing a copy of them though.
22:31 🔗 k_o I mean the 30 million thing is pretty widely reported https://finance.yahoo.com/news/vsco-now-30-million-active-170002551.html
22:31 🔗 k_o That's actually the only info I can find about their stats. No user info since then, no size info, no quarterly reports.
22:32 🔗 k_o I'm not even really sure how they make money, there's no articles about it on the first pages of search.
22:32 🔗 k_o But yeah, there's the issue of privacy and all that. I remember the Instagram project got a lot of bad press
22:32 🔗 k_o IA may not want that
22:35 🔗 k_o Anyways, I thought I'd float the idea to archiveteam, see if anyone was interested
22:35 🔗 purplebot has joined #archiveteam-bs
22:36 🔗 Rai-chan has joined #archiveteam-bs
22:36 🔗 JAA Looks like you can purchase something called "VSCO Film"?
22:36 🔗 k_o There's no immediate danger, but I remember how short notice on vidme meant we couldn't save all of it
22:36 🔗 k_o Hard to imagine how one product could bring in enough cash to host as much data as they do
22:37 🔗 k_o Who knows, though, they don't seem to post earnings or anything
22:37 🔗 JAA Yeah, it's nice to have an idea of how the site works etc. already so we can grab it quickly when they announce the shutdown.
22:37 🔗 Jon astrid, yeah, thanks -- it was http://archive.org/details/NineInchNailsGhostsI-Ivblu-ray24bit96khz$
22:37 🔗 Jon minus the $ http://archive.org/details/NineInchNailsGhostsI-Ivblu-ray24bit96khz
22:37 🔗 i0npulse has joined #archiveteam-bs
22:37 🔗 astrid right
22:37 🔗 Jon astrid: the album is widely available in 16/44.1 (including several times on archive.org); in 24/96 (as on the BD) it's much rarer. I just sourced one after 10 years or so, and it cost me £50
22:38 🔗 Jon despite that it's still clearly marked as CC-BY-SA-NC
22:38 🔗 astrid that was darked in december 2014 with the comment "possible rights issues"
22:38 🔗 astrid email info@archive.org and maybe they'll un-dark it
22:38 🔗 Jon thanks, I shall. Can you tell if I was the original uploader? I've completely forgotten. My username is jmtd on archive.org
22:38 🔗 Jon thanks for all your help
22:38 🔗 JAA k_o: Apparently you can also buy filters and possibly other stuff through an in-app store. The famous microtransactions scheme.
22:39 🔗 astrid original uploader was someone with email address 893productions@gmail.com
22:39 🔗 Jon ok yeah that wasn't me. Thanks :>
22:39 🔗 k_o In that case, their business model may be sound
22:39 🔗 Jon I'll still email
22:39 🔗 astrid sure thing Jon
22:39 🔗 k_o I figured it's a website worth keeping an eye on though
22:39 🔗 JAA k_o: Sure. Are you willing to share your code for scraping users?
22:39 🔗 * Jon goes to bed
22:40 🔗 k_o Sure, it's written in python and uses selenium
22:40 🔗 k_o I can put it up on pastebin
22:41 🔗 k_o It's probably not the most efficient way to go about it, but I don't know how else to render their crappy website except for a headless browser
22:42 🔗 JAA Yeah, it should be a lot faster to just do the relevant API requests directly.
22:43 🔗 JAA I'm interested in seeing the code anyway, also because I wanted to look into headless browsers for archiving before.
22:45 🔗 k_o_ has joined #archiveteam-bs
22:45 🔗 k_o_ internet crashed
22:45 🔗 k_o_ idk if the message got through, I'll upload the code to pastebin
22:45 🔗 k_o has quit IRC (Ping timeout: 260 seconds)
22:46 🔗 godane i get to have fun setting up my new comcast cable modem latter
22:46 🔗 JAA k_o_: Here's what happened: http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2018-01-09,Tue&sel=229#l225
22:47 🔗 k_o_ Alright, that's all the messages I sent
22:47 🔗 k_o_ Gimme a sec to cut out the code and put it up
22:50 🔗 k_o_ https://pastebin.com/au6eSN39
22:51 🔗 k_o_ You start if off by creating a file vsco.txt with
22:51 🔗 k_o_ At least one username and a "|" before the first username
22:51 🔗 k_o_ It searches the collection for each user and adds those names to the file, going through all of the new names, so theoretically it will eventually scrape every non-orphan user on the site
22:52 🔗 k_o_ If you need to break the script, just move the | back to the point you want it, and it won't search through the first names again
22:52 🔗 k_o_ It also checks for duplicates and won't add those, so each username is unique
22:53 🔗 JAA Ah, collections, I see.
22:53 🔗 JAA Thanks
22:54 🔗 k_o_ My vsco.txt is slightly over 157,000 lines currently, but with 30 million active users, that's barely half a percent
22:55 🔗 k_o_ It's been running for about a day, so given a few weeks, it could probably build up a pretty good list
22:55 🔗 k_o_ I figured it would be helpful to have around if/when there's a shutdown notice
22:56 🔗 JAA Indeed
23:07 🔗 JAA Another idea to discover users would be to search for tags appearing on the individual photo pages.
23:10 🔗 k_o_ I think most of the people who are tagged also appear on the collection, but I could be wrong
23:11 🔗 k_o_ If the script I'm running finishes with a lot of users missing, I could try that
23:29 🔗 BlueMaxim has joined #archiveteam-bs
23:57 🔗 wbradley has quit IRC (WeeChat 1.4)

irclogger-viewer