#archiveteam-bs 2018-01-09,Tue

↑back Search

Time	Nickname	Message
00:03 ^🔗		ranav has quit IRC (Read error: Connection reset by peer)
00:14 ^🔗		ranavalon has joined #archiveteam-bs
00:14 ^🔗		ranavalon has quit IRC (Remote host closed the connection)
00:15 ^🔗		ranavalon has joined #archiveteam-bs
00:18 ^🔗		BlueMaxim has quit IRC (Leaving)
01:00 ^🔗		ranavalon has quit IRC (Quit: Leaving)
01:15 ^🔗		BlueMaxim has joined #archiveteam-bs
01:42 ^🔗		yuitimoth has quit IRC (Read error: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac)
01:42 ^🔗		yuitimoth has joined #archiveteam-bs
01:54 ^🔗		yuitimoth has quit IRC (Read error: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac)
01:54 ^🔗		yuitimoth has joined #archiveteam-bs
02:12 ^🔗		DFJustin has quit IRC (Remote host closed the connection)
02:15 ^🔗		DFJustin has joined #archiveteam-bs
02:15 ^🔗		swebb sets mode: +o DFJustin
02:38 ^🔗	bithippo	Is it possible to edit an item's collection it belongs to after creating said item?
02:43 ^🔗		bithippo has quit IRC (Ping timeout: 260 seconds)
03:29 ^🔗		atlogbot has quit IRC (Read error: Operation timed out)
03:29 ^🔗		swebb has quit IRC (Read error: Operation timed out)
03:30 ^🔗		swebb has joined #archiveteam-bs
03:30 ^🔗		atlogbot has joined #archiveteam-bs
03:30 ^🔗		svchfoo3 sets mode: +o swebb
03:30 ^🔗		svchost03 sets mode: +v atlogbot
04:46 ^🔗		jdude104 has quit IRC (Read error: Operation timed out)
04:49 ^🔗		qw3rty14 has joined #archiveteam-bs
04:53 ^🔗		qw3rty13 has quit IRC (Read error: Operation timed out)
05:05 ^🔗		K4k has quit IRC (Read error: Connection reset by peer)
05:42 ^🔗		w0rp has quit IRC (Ping timeout: 245 seconds)
05:45 ^🔗		w0rp has joined #archiveteam-bs
06:28 ^🔗		zyphlar has joined #archiveteam-bs
07:04 ^🔗		sekolyn has joined #archiveteam-bs
07:05 ^🔗		octothorp has quit IRC (Read error: Operation timed out)
07:28 ^🔗		octothorp has joined #archiveteam-bs
07:29 ^🔗		sekolyn has quit IRC (Read error: Operation timed out)
07:29 ^🔗		kpz has joined #archiveteam-bs
07:30 ^🔗		kpz has left
07:44 ^🔗		Asparagir has joined #archiveteam-bs
08:38 ^🔗		zyphlar has quit IRC (Quit: Connection closed for inactivity)
08:46 ^🔗		Asparagir has quit IRC (Asparagir)
09:49 ^🔗		BlueMaxim has quit IRC (Read error: Connection reset by peer)
11:07 ^🔗		slyphic has quit IRC (Read error: Operation timed out)
11:13 ^🔗		slyphic has joined #archiveteam-bs
11:42 ^🔗		ZexaronS- has quit IRC (Quit: Leaving)
12:31 ^🔗		altlabel has joined #archiveteam-bs
12:35 ^🔗	JAA	Anyone else getting a lot of errors when accessing the Wayback Machine? I get "unable to connect", timeouts, pages which never finish loading, etc.
12:46 ^🔗		sep332_ has joined #archiveteam-bs
12:47 ^🔗		sep332 has quit IRC (Read error: Operation timed out)
13:30 ^🔗		jacketcha has quit IRC (Read error: Connection reset by peer)
13:48 ^🔗	JAA	Seems to be better now.
15:20 ^🔗		Mateon1 has quit IRC (Ping timeout: 255 seconds)
15:20 ^🔗		Mateon1 has joined #archiveteam-bs
15:33 ^🔗		jdude104 has joined #archiveteam-bs
15:46 ^🔗		jdude104 has quit IRC (Quit: Leaving)
16:37 ^🔗		schbirid has joined #archiveteam-bs
17:33 ^🔗		RichardG_ has joined #archiveteam-bs
17:33 ^🔗		RichardG has quit IRC (Read error: Connection reset by peer)
17:47 ^🔗		RichardG_ has quit IRC (Read error: Connection reset by peer)
17:50 ^🔗		jschwart has joined #archiveteam-bs
17:53 ^🔗		RichardG has joined #archiveteam-bs
17:54 ^🔗		RichardG has quit IRC (Read error: Connection reset by peer)
17:57 ^🔗		RichardG has joined #archiveteam-bs
18:13 ^🔗	SketchCow	I'm cleaning WARCs still
18:13 ^🔗	SketchCow	https://archive.org/details/archiveteam_miiverse is getting that massive miiverse grab
18:14 ^🔗	SketchCow	https://archive.org/details/warczone now exists. It is "outsider" WARCs, WARCs where we have no idea who is sending them. There's a good chance they won't go directly into Wayback.
18:15 ^🔗		ReimuHaku has quit IRC (Ping timeout: 250 seconds)
18:17 ^🔗		ReimuHaku has joined #archiveteam-bs
18:17 ^🔗		RichardG has quit IRC (Read error: Connection reset by peer)
18:17 ^🔗		RichardG has joined #archiveteam-bs
18:22 ^🔗		K4k has joined #archiveteam-bs
18:23 ^🔗	jrwr	SketchCow: thats a damn fine pun you made there
18:46 ^🔗	SketchCow	https://archive.org/details/archiveteam_yahoogroups is about to get super huge
18:48 ^🔗		adinbied has joined #archiveteam-bs
18:50 ^🔗		RichardG has quit IRC (Read error: Connection reset by peer)
18:50 ^🔗	adinbied	Hi all, I seem to have lost the link to the Discord server, can anyone send it to me? A while back I asked about archiving Gazelle-based sites, and got linked to the discord server to talk to -Archivist-, as he/she was working on that at the time. Thanks!
18:51 ^🔗		RichardG has joined #archiveteam-bs
18:53 ^🔗	JAA	adinbied: If it was posted in here, try searching the logs: http://archive.fart.website/bin/irclogger_logs
18:55 ^🔗	adinbied	Found it, thanks!
18:59 ^🔗		adinbied has quit IRC (Quit: Page closed)
19:35 ^🔗		ndiddy_ has quit IRC ()
19:38 ^🔗		ndiddy has joined #archiveteam-bs
20:01 ^🔗		REiN^ has quit IRC (no.money.no.love)
20:02 ^🔗		purplebot has quit IRC (Ping timeout: 248 seconds)
20:02 ^🔗	PurpleSym	SketchCow: Can I get permission to upload to that collection?
20:02 ^🔗		HCross2 has quit IRC (Ping timeout: 248 seconds)
20:03 ^🔗		Rai-chan has quit IRC (Ping timeout: 248 seconds)
20:03 ^🔗		i0npulse has quit IRC (Ping timeout: 248 seconds)
20:09 ^🔗		RichardG has quit IRC (Ping timeout: 248 seconds)
20:10 ^🔗		RichardG has joined #archiveteam-bs
20:17 ^🔗		AeonG_ has joined #archiveteam-bs
20:22 ^🔗		Caz has quit IRC (Read error: Operation timed out)
20:30 ^🔗		purplebot has joined #archiveteam-bs
20:30 ^🔗		i0npulse has joined #archiveteam-bs
20:31 ^🔗		HCross2 has joined #archiveteam-bs
20:31 ^🔗		svchfoo1 sets mode: +o HCross2
20:33 ^🔗		Rai-chan has joined #archiveteam-bs
20:33 ^🔗		odemg has quit IRC (Read error: Operation timed out)
20:36 ^🔗		odemg has joined #archiveteam-bs
20:41 ^🔗	SketchCow	I don't see why not, you're one of the processes I've got cleaning up
20:43 ^🔗	SketchCow	You now have archiveteam_yahoogroups. You might need to log out of your browser to get it noticed.
20:50 ^🔗	DrasticAc	SketchCow: Thanks for moving those miiverse files
20:52 ^🔗	DrasticAc	Kinda realized part of the way through making them that I _probably_ should have less of them, rather than 10,000 post chunks.
20:52 ^🔗	DrasticAc	But, hey, it's easier for people to download a 200 MB warc than multiple terabytes if they just need one post ;)
20:55 ^🔗		octothorp has quit IRC (Read error: Connection reset by peer)
20:59 ^🔗		Rai-chan has quit IRC (Ping timeout: 248 seconds)
21:01 ^🔗		HCross2 has quit IRC (Ping timeout: 248 seconds)
21:01 ^🔗		purplebot has quit IRC (Ping timeout: 248 seconds)
21:02 ^🔗	godane	SketchCow: my cat throw up on one of your boxes
21:03 ^🔗	godane	I NEED TO GET LABELS NOW SO I CAN MAIL THEM BEFORE THE CAT RUINS YOUR STUFF
21:06 ^🔗		i0npulse has quit IRC (Ping timeout: 248 seconds)
21:06 ^🔗	godane	tapes are fine but box has dry cat vomit on it
21:08 ^🔗		SketchCow changes topic to: Lengthy Archive Team related discussions here \| General archiving & offtopic: #archiveteam-ot \| < godane> SketchCow: my cat throw up on one of your boxes
21:08 ^🔗	SketchCow	Let me get on that
21:08 ^🔗	SketchCow	DrasticAc: yes, if I'd had more of a say on your project, I'd have said you should have 50gb per item
21:09 ^🔗	DrasticAc	Yeah, it was one of those things I didn't know until it was too late to switch.
21:10 ^🔗	DrasticAc	But next time, I have a better idea of what to do.
21:10 ^🔗	Igloo	godane: i am glad I am not the only one with that problem. My cats puke on stuff all the time ¬_¬
21:11 ^🔗	DrasticAc	I don't know if it'll be useful, but I was thinking of making a mini-archivebot for stuff like Slack or Discord.
21:12 ^🔗	SketchCow	https://archive.org/details/archiveteam_verizon
21:12 ^🔗	SketchCow	You can see my script slowly adding a filler logo to all the items
21:12 ^🔗	DrasticAc	Since it seems like a portion of stuff that gets submitted to archivebot are one-off sites (like twitter links), having something like that available more widely might be useful.
21:13 ^🔗	godane	lgloo: lucky for most of my stuff is in my room
21:13 ^🔗	DrasticAc	Although I guess you can use the IA extension for that.
21:13 ^🔗	SketchCow	The problem is that people are not very good at assessing archivebot
21:13 ^🔗	godane	and the cat doesn't come into my room
21:13 ^🔗	godane	but there is no room for boxes in my room
21:13 ^🔗	SketchCow	And we get people doing things like "hurr durr The Onion is pretty amazeballs, I better kick off a million-url job with one line because just in case"
21:14 ^🔗	SketchCow	"Hey, someone mirrored a mirror of a mirror we mirror, better get THAT copy too"
21:14 ^🔗	Igloo	We are trying to police that much better though....
21:14 ^🔗	SketchCow	We are
21:14 ^🔗	SketchCow	Adding it to random discords or slacks would not be smark
21:14 ^🔗	DrasticAc	Could keep a database to check against that though.
21:14 ^🔗	SketchCow	I'd kill any link
21:14 ^🔗	DrasticAc	Like, if x link was already archived, don't do it again.
21:14 ^🔗	SketchCow	Drop to a whitelist of people who can kick off jobs
21:15 ^🔗	Igloo	DrasticAc we do that. But it's just a bit broken at the moment. If you want to help us fix it we'd appreciate it ;-)
21:15 ^🔗	Igloo	AB is a victim of it's own success.
21:16 ^🔗	SketchCow	Just saying. Don't make more links to archivebot
21:16 ^🔗	godane	in other news i got my archivebox rpi project to broadcast a 'honeypot' wifi
21:16 ^🔗	SketchCow	Or things that can kick off archivebot to an even larger set of feel-no-pain instigators
21:17 ^🔗	DrasticAc	Oh no, I'm not saying make a slack bot that talks to _our_ archivebot.
21:17 ^🔗	godane	next part of my project is to add a local wayback machine to it
21:17 ^🔗	DrasticAc	I'm saying "make something totally different that offers a limited set of its functions"
21:17 ^🔗	SketchCow	Oh, here's a project I was thinking about that someone should do.
21:17 ^🔗	SketchCow	Ready?
21:17 ^🔗	SketchCow	You seem to all be quite capable of this.
21:17 ^🔗	SketchCow	A little package, that if you drop it in a directory, and the directory has WARCs, you get a little mini wayback for it
21:18 ^🔗	SketchCow	Which maybe a navigatron option for the family of URLs it covers
21:19 ^🔗	Igloo	So, Something that can run on any server? and provide a way back feel for the warcs in that directory?
21:19 ^🔗	SketchCow	Yes.
21:19 ^🔗	SketchCow	Or a subdirectory, I guess
21:19 ^🔗	SketchCow	WARCS/
21:19 ^🔗	Igloo	Interesting, I like the idea of that
21:20 ^🔗	DrasticAc	Yeah, that sounds very useful
21:20 ^🔗	SketchCow	Do it
21:20 ^🔗	SketchCow	waiting
21:20 ^🔗	*	SketchCow taps watch
21:21 ^🔗	DrasticAc	Just wait till I get off of work, have dinner, etc.
21:24 ^🔗	SketchCow	https://www.youtube.com/watch?v=af3mlZ28MzI
21:24 ^🔗	Igloo	<< I love that film >>
21:36 ^🔗		purplebot has joined #archiveteam-bs
21:43 ^🔗		purplebot has quit IRC (hub.dk irc.underworld.no)
22:12 ^🔗		k_o has joined #archiveteam-bs
22:14 ^🔗		Jon has joined #archiveteam-bs
22:16 ^🔗	Jon	hmm. I've got a blu ray, CC-BY-SA-NC, but it is DRM protected. I would like to put it on archive.org but not sure whether to put it up with or without the DRM. Also a prior upload by someone else years back got deleted without explanation
22:17 ^🔗	astrid	do you have a link to this prior upload? it was probably darked because the copyright holder complained. i can check though.
22:22 ^🔗		octothorp has joined #archiveteam-bs
22:22 ^🔗	JAA	k_o: VSCO will be quite annoying to archive with all that JS going on. If you could write up a summary of what the site structure is like and how the content can be accessed, that would be great.
22:23 ^🔗	JAA	Looks like they don't use numeric IDs though, so iterating over everything won't be easy.
22:23 ^🔗	k_o	Oh, the site is one of the worst things I've ever seen.
22:23 ^🔗	k_o	I've got two scripts that can download it, though.
22:24 ^🔗	JAA	That's definitely also helpful, yes.
22:24 ^🔗	k_o	The one I prefer is from github and it's written in ruby
22:24 ^🔗	k_o	Lemme find the link
22:24 ^🔗	JAA	(Ugh, Ruby. ;-) )
22:24 ^🔗	k_o	https://github.com/HuggableSquare/vsco-dl Well, the other one I wrote in Python, but it's a good deal slower than this one, and doesn't get nearly as much metadata
22:25 ^🔗	k_o	This puts everything in a folder, but the naming is pretty crap, so I wrote a Python script to rename the files to the year, month, and day
22:26 ^🔗	k_o	After that I run packjpg to compress everything to about 75% and then pack it into .tar.bz2 archives
22:26 ^🔗	JAA	Well, we usually archive in the WARC format if possible.
22:26 ^🔗	k_o	I'm not too familiar with WARC, so some changes would probably be necessary there
22:27 ^🔗	JAA	What vsco-dl does should be fairly easy to do with a plugin for wget-lua or wpull.
22:27 ^🔗	k_o	Yeah, the problem is that I'm averaging 220MB/user right now
22:27 ^🔗	k_o	My current list is 150,000 names and growing, so it's already in the 30 TB range, which is more space than I have
22:27 ^🔗	JAA	Any idea how large it is in total?
22:27 ^🔗	JAA	Ah
22:28 ^🔗	k_o	The thing is, VSCO reported 30 million active monthly users last year
22:28 ^🔗	k_o	So it's probably in the petabytes range at least
22:28 ^🔗		jschwart has quit IRC (Konversation terminated!)
22:28 ^🔗	JAA	Hmm, that seems way too large for a photo sharing website.
22:29 ^🔗	JAA	Vidme and SoundCloud are in that range.
22:29 ^🔗	JAA	(Well, Vidme was and SC is.)
22:29 ^🔗	k_o	Exactly, vidme was
22:29 ^🔗	k_o	and SC was threatening to go under
22:29 ^🔗	JAA	Yeah
22:29 ^🔗	k_o	hence my concern
22:30 ^🔗	k_o	what happened to SC, anyway? did they find new funding?
22:30 ^🔗	JAA	Right, but I can't believe that VSCO gets even close to 1 PB.
22:31 ^🔗	JAA	I'm not sure what IA thinks about grabbing a copy of them though.
22:31 ^🔗	k_o	I mean the 30 million thing is pretty widely reported https://finance.yahoo.com/news/vsco-now-30-million-active-170002551.html
22:31 ^🔗	k_o	That's actually the only info I can find about their stats. No user info since then, no size info, no quarterly reports.
22:32 ^🔗	k_o	I'm not even really sure how they make money, there's no articles about it on the first pages of search.
22:32 ^🔗	k_o	But yeah, there's the issue of privacy and all that. I remember the Instagram project got a lot of bad press
22:32 ^🔗	k_o	IA may not want that
22:35 ^🔗	k_o	Anyways, I thought I'd float the idea to archiveteam, see if anyone was interested
22:35 ^🔗		purplebot has joined #archiveteam-bs
22:36 ^🔗		Rai-chan has joined #archiveteam-bs
22:36 ^🔗	JAA	Looks like you can purchase something called "VSCO Film"?
22:36 ^🔗	k_o	There's no immediate danger, but I remember how short notice on vidme meant we couldn't save all of it
22:36 ^🔗	k_o	Hard to imagine how one product could bring in enough cash to host as much data as they do
22:37 ^🔗	k_o	Who knows, though, they don't seem to post earnings or anything
22:37 ^🔗	JAA	Yeah, it's nice to have an idea of how the site works etc. already so we can grab it quickly when they announce the shutdown.
22:37 ^🔗	Jon	astrid, yeah, thanks -- it was http://archive.org/details/NineInchNailsGhostsI-Ivblu-ray24bit96khz$
22:37 ^🔗	Jon	minus the $ http://archive.org/details/NineInchNailsGhostsI-Ivblu-ray24bit96khz
22:37 ^🔗		i0npulse has joined #archiveteam-bs
22:37 ^🔗	astrid	right
22:37 ^🔗	Jon	astrid: the album is widely available in 16/44.1 (including several times on archive.org); in 24/96 (as on the BD) it's much rarer. I just sourced one after 10 years or so, and it cost me £50
22:38 ^🔗	Jon	despite that it's still clearly marked as CC-BY-SA-NC
22:38 ^🔗	astrid	that was darked in december 2014 with the comment "possible rights issues"
22:38 ^🔗	astrid	email info@archive.org and maybe they'll un-dark it
22:38 ^🔗	Jon	thanks, I shall. Can you tell if I was the original uploader? I've completely forgotten. My username is jmtd on archive.org
22:38 ^🔗	Jon	thanks for all your help
22:38 ^🔗	JAA	k_o: Apparently you can also buy filters and possibly other stuff through an in-app store. The famous microtransactions scheme.
22:39 ^🔗	astrid	original uploader was someone with email address 893productions@gmail.com
22:39 ^🔗	Jon	ok yeah that wasn't me. Thanks :>
22:39 ^🔗	k_o	In that case, their business model may be sound
22:39 ^🔗	Jon	I'll still email
22:39 ^🔗	astrid	sure thing Jon
22:39 ^🔗	k_o	I figured it's a website worth keeping an eye on though
22:39 ^🔗	JAA	k_o: Sure. Are you willing to share your code for scraping users?
22:39 ^🔗	*	Jon goes to bed
22:40 ^🔗	k_o	Sure, it's written in python and uses selenium
22:40 ^🔗	k_o	I can put it up on pastebin
22:41 ^🔗	k_o	It's probably not the most efficient way to go about it, but I don't know how else to render their crappy website except for a headless browser
22:42 ^🔗	JAA	Yeah, it should be a lot faster to just do the relevant API requests directly.
22:43 ^🔗	JAA	I'm interested in seeing the code anyway, also because I wanted to look into headless browsers for archiving before.
22:45 ^🔗		k_o_ has joined #archiveteam-bs
22:45 ^🔗	k_o_	internet crashed
22:45 ^🔗	k_o_	idk if the message got through, I'll upload the code to pastebin
22:45 ^🔗		k_o has quit IRC (Ping timeout: 260 seconds)
22:46 ^🔗	godane	i get to have fun setting up my new comcast cable modem latter
22:46 ^🔗	JAA	k_o_: Here's what happened: http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2018-01-09,Tue&sel=229#l225
22:47 ^🔗	k_o_	Alright, that's all the messages I sent
22:47 ^🔗	k_o_	Gimme a sec to cut out the code and put it up
22:50 ^🔗	k_o_	https://pastebin.com/au6eSN39
22:51 ^🔗	k_o_	You start if off by creating a file vsco.txt with
22:51 ^🔗	k_o_	At least one username and a "\|" before the first username
22:51 ^🔗	k_o_	It searches the collection for each user and adds those names to the file, going through all of the new names, so theoretically it will eventually scrape every non-orphan user on the site
22:52 ^🔗	k_o_	If you need to break the script, just move the \| back to the point you want it, and it won't search through the first names again
22:52 ^🔗	k_o_	It also checks for duplicates and won't add those, so each username is unique
22:53 ^🔗	JAA	Ah, collections, I see.
22:53 ^🔗	JAA	Thanks
22:54 ^🔗	k_o_	My vsco.txt is slightly over 157,000 lines currently, but with 30 million active users, that's barely half a percent
22:55 ^🔗	k_o_	It's been running for about a day, so given a few weeks, it could probably build up a pretty good list
22:55 ^🔗	k_o_	I figured it would be helpful to have around if/when there's a shutdown notice
22:56 ^🔗	JAA	Indeed
23:07 ^🔗	JAA	Another idea to discover users would be to search for tags appearing on the individual photo pages.
23:10 ^🔗	k_o_	I think most of the people who are tagged also appear on the collection, but I could be wrong
23:11 ^🔗	k_o_	If the script I'm running finishes with a lot of users missing, I could try that
23:29 ^🔗		BlueMaxim has joined #archiveteam-bs
23:57 ^🔗		wbradley has quit IRC (WeeChat 1.4)

irclogger-viewer