#archiveteam-bs 2016-07-31,Sun

↑back Search

Time	Nickname	Message
00:01 ^🔗	hook54321	Would grab-site run on a Raspberry Pi? :P
00:01 ^🔗	MrRadar	Again, I don't see why not. Though it might not be super-fast
00:02 ^🔗	MrRadar	I switched to running wpull on an old Core 2 Duo laptop after about a week since every time I started it it would take 15+ seconds to actually start doing work
00:02 ^🔗	MrRadar	On the Pi
00:03 ^🔗	MrRadar	And it would noticably pause to parse large web pages
00:03 ^🔗	MrRadar	Or even not so large pages
00:03 ^🔗	MrRadar	Even an old Atom like that is probably multiple times faster than the 1st Raspberry Pi CPU
00:03 ^🔗	MrRadar	Which is approximately equivalent to a Pentium 2
00:04 ^🔗	hook54321	The computer with the Atom processor only has a 160 GB hard drive, I'm pretty sure that will become an issue.
00:04 ^🔗	MrRadar	It depends on how big the site is that you're trying to scrape
00:05 ^🔗	MrRadar	I think grab-site instructs wpull to split the WARC files at a certain size
00:05 ^🔗		whydomain has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client)
00:05 ^🔗	MrRadar	So you could copy them off to an external HD as each WARC finishes
00:06 ^🔗	MrRadar	Hmm... looking at the source it actually looks like it doesn't by default
00:07 ^🔗	MrRadar	Wait, no, I'm wrong about being wrong
00:07 ^🔗	MrRadar	A few lines down it adds that option
00:09 ^🔗	Frogging	hook54321: it's a bad idea given the flash memory
00:09 ^🔗	Frogging	(SD card)
00:09 ^🔗	MrRadar	To run it on a Pi? Yeah. I/O performance sucks big time on those, on top of the CPU performance issues
00:09 ^🔗	Frogging	constant downloading, uploading, and erasing
00:11 ^🔗	hook54321	Has anyone tried to run archivebot or grab-site on Kali Linux?
00:12 ^🔗	MrRadar	Looking it up I see it's Debian-based so you shouldn't have any major trouble
00:13 ^🔗	hook54321	The site I'm archiving is my high schools' website that they use for teacher websites and homework. But it requires a login which is why I'm doing it myself.
00:14 ^🔗	JesseW	hook54321: that's a really good idea. THank you for doing that.
00:16 ^🔗	hook54321	It's probably better than my other idea, which is to download everything off of the shared network drive.
00:17 ^🔗	JesseW	Both? Both sounds good.
00:18 ^🔗	DoomTay	I dunno. Copying from the shared drive sounds easier, assuming you can get to it from wherever you are
00:19 ^🔗		BlueMaxim has joined #archiveteam-bs
00:23 ^🔗	hook54321	The website and the drive contain completely different content. There is a computer login that I can use that isn't associated with any student. There is also a web based remote desktop that I could use, but the guest login doesn't work on that.
00:24 ^🔗	DoomTay	Ah
00:25 ^🔗	DoomTay	I just hope the staff is cool with what you're doing, lest you risk trampling the site or breaking certain terms
00:25 ^🔗	hook54321	Would they notice if someone copied everything from the network drive onto an external drive?
00:26 ^🔗	DoomTay	I have no idea
00:26 ^🔗		mismatch has quit IRC (Ping timeout: 501 seconds)
00:27 ^🔗	MrRadar	It probably depends on how much audit logging they're doing and whether they bother to look at them
00:32 ^🔗	JesseW	and how fast you do it
00:33 ^🔗	JesseW	if, say, you copied one file per hour (average, with a random delay), I'm pretty certain they wouldn't notice or care
00:57 ^🔗		metalcamp has quit IRC (Ping timeout: 501 seconds)
01:10 ^🔗	ranma	is donald trump's twitter being backed up actively?
01:11 ^🔗	ranma	not for mentally stimulating reasons
01:14 ^🔗	hook54321	By actively do you mean that whenever he posts something it's automatically backed up? Or like a weekly backup?
01:15 ^🔗		nightpool has quit IRC (Read error: Operation timed out)
01:35 ^🔗		Start_ has joined #archiveteam-bs
01:35 ^🔗		Start has quit IRC (Read error: Connection reset by peer)
01:37 ^🔗		ItsYoda has quit IRC (Ping timeout: 260 seconds)
01:41 ^🔗		ItsYoda has joined #archiveteam-bs
01:52 ^🔗	ranma	automatically
01:52 ^🔗	ranma	whenever
01:53 ^🔗	ranma	i wonder if twitter just "hides" deleted posts
01:53 ^🔗	ranma	a la IA
01:54 ^🔗	MrRadar	It looks like the IA has been capturing it multiple times per day for months now: https://web.archive.org/web/*/https://twitter.com/realdonaldtrump/
01:54 ^🔗	ranma	nice
01:54 ^🔗	*	ranma chuckles
02:07 ^🔗		nightpool has joined #archiveteam-bs
02:21 ^🔗		ndiddy has quit IRC (Leaving)
02:22 ^🔗		ndiddy has joined #archiveteam-bs
02:23 ^🔗	MrRadar	Yay, Maker.tv finally fixed the blip.tv domain to point to a working server so the IA is no longer blocking it due to robots.txt issues
02:25 ^🔗	JesseW	excellent!
02:26 ^🔗	JesseW	are you sure IA didn't just change their mind about what a non-working server means?
02:27 ^🔗	MrRadar	Hmm... maybe they did. If you go to http://blip.tv it redirects to http://maker.tv
02:27 ^🔗	MrRadar	But if you go directly to http://blip.tv/robots.txt you still get a CloudFlare error page
02:28 ^🔗	MrRadar	Either way the IA is no longer blocking access to blip, so I'm happy
02:29 ^🔗	DoomTay	Clooooudflaaaaare
03:12 ^🔗	hook54321	They've been capturing Hillary's feed significantly less than Trump's... Kinda concerns me, but I hate both of them in different ways but essentially equally.
03:12 ^🔗	hook54321	https://web.archive.org/web/*/https://twitter.com/hillaryclinton/
03:15 ^🔗	xmc	"they"
03:20 ^🔗	hook54321	"they"?
03:21 ^🔗	xmc	who are "they"
03:23 ^🔗	hook54321	It could either be the internet archive's algorithm for what gets archived and how often or people manually archiving it through the save now button or both.
03:28 ^🔗	JesseW	(pardon the politics, but) I can't speak to your personal emotions, but I very strongly disagree with the implied idea that Hillary and Trump's ACTIONS, if elected, would be similar. Hillary is vastly less likely than Trump to radically undermine the stability of the country.
03:33 ^🔗	yipdw	JesseW: there's things I want to do on archivebot but I have other, higher priorities
03:34 ^🔗	yipdw	I recommend people use grab-site because (a) it is local, (b) with high occurrence, people who ask about archivebot are trying to backup *chan or some shit
03:34 ^🔗	JesseW	Ah, that makes sense.
03:35 ^🔗	JesseW	I thought the plan was to re-implement the features that are currently only in ArchiveBot in grab-site, then switch over.
03:37 ^🔗	DoomTay	Doesn't grab-site only do a third of the job?
03:37 ^🔗	DoomTay	I mean, you would still have to upload the WARCs and all that
03:37 ^🔗	yipdw	that would be nice
03:38 ^🔗	yipdw	I also haven't really been that motivated to work on it, because it fulfills its original mission fine
03:38 ^🔗	yipdw	I am not really that interested in making it function for gazillion-page sites or real-time doxing
03:39 ^🔗	*	JesseW nods
03:39 ^🔗	yipdw	Javascript-heavy sites are a useful exception and that is something I do want to fix
03:39 ^🔗	JesseW	I'm not sure what you mean by "real-time doxing" (and I'm not sure I want to know)...
03:40 ^🔗	DoomTay	Keeping tabs on twitter feeds?
03:40 ^🔗	yipdw	back up a facebook page every 20 minutes, back up a twitter account continuously
03:40 ^🔗	yipdw	add login capability, add cookie jars
03:40 ^🔗	yipdw	anything to make it pry further
03:42 ^🔗	hook54321	I don't think we need it to do scheduled backups like that, but we should consider trying to do whatever archive.is did to make it so they could archive facebook pages.
03:42 ^🔗	yipdw	see
03:42 ^🔗	yipdw	give a mouse a cookie
03:43 ^🔗	hook54321	http://1.media.collegehumor.cvcdn.com/80/98/f45d988f4740f60eaea9776e76acbce5.jpg
03:43 ^🔗	Frogging	and they'll want a cookie jar
03:43 ^🔗	JesseW	yipdw: ah, ok, now I understand
03:43 ^🔗	JesseW	and I'm pretty sure I agree
03:44 ^🔗	yipdw	privacy and security is generally pretty tenuous, even if you're a nerd, and I don't think we need more Well Actually tools that take advantage of that problem in the architecture
03:44 ^🔗	yipdw	</soapbox>
03:44 ^🔗	Frogging	most nerds seem pro-privacy until it's someone else's :p
03:45 ^🔗	hook54321	What problem?
03:45 ^🔗	xmc	pretty much
03:45 ^🔗	yipdw	yes, you get that a lot in here
03:45 ^🔗	hook54321	Frogging: pretty much. you could also say archivists (archivers?)
03:46 ^🔗	yipdw	hook54321: the problem is that the web is assumed public-by-default and is built to encourage that. I think this is awesome for certain sorts of knowledge (e.g. scientific) but we have gone waaaaay beyond that and even to this date access controls are a mess except to that small population that kinda sorta knows what's going on
03:47 ^🔗	hook54321	what's an example of us going way beyond that?
03:47 ^🔗	yipdw	you don't need much to do an access escalation. for example a cookie jar gives you the ability to poke into friends-only Facebook posts and expose them to a larger, not-friends-only audience
03:48 ^🔗	hook54321	it would be a dummy account, no friends or any sort of connections.
03:48 ^🔗	yipdw	"we" meaning the sorts of things the Web is now used for
03:48 ^🔗	yipdw	I don't trust people to use dummy accounts and I don't trust websites, especially dying websites, to really give a shit
03:49 ^🔗	hook54321	ah. what did you mean by access controls?
03:49 ^🔗	xmc	privacy settings
03:49 ^🔗	yipdw	friend requests
03:49 ^🔗	yipdw	these are access controls
03:49 ^🔗	yipdw	we have on occasion breached them (e.g. Friendster)
03:49 ^🔗	Frogging	hook54321: for projects that need these things, we can use them. but there's no reason to implement it in something as general-purpose as archivebot
03:51 ^🔗	yipdw	anyway this is sort of a long way of saying that yeah there is stuff to do on archivebot but I shelved it, and if someone else would like to dig in, I do have a history of accepting PRs
03:51 ^🔗	hook54321	But isn't it kinda a necessity at this point, because we can't really successfully archive Donald Trump's or Hillary Clinton's Facebook page right now. :P
03:51 ^🔗	yipdw	so get their public websites
03:51 ^🔗	yipdw	good enough
03:51 ^🔗	Frogging	aren't there pages public?
03:51 ^🔗	Frogging	their*
03:51 ^🔗	yipdw	if they're public even better
03:52 ^🔗	Frogging	if they are then I don't get what the issue is
03:52 ^🔗	hook54321	The Facebook pages are public but Facebook hates us and throws captchas at us.
03:52 ^🔗	Frogging	oh
03:53 ^🔗	hook54321	Which archive.is has somehow gotten around and however they did it involves a dummy account.
03:53 ^🔗	yipdw	I'd recommend using archive.is for now, or using grab-site on your own and copying in a Facebook cookie
03:54 ^🔗	yipdw	(at this stage you'd be absolutely right to point out that we have the tools to do everything I said I don't want to encourage. but there's a difference between cookie-jar manipulation and --cookies=)
03:56 ^🔗	hook54321	I'm afraid I don't quite understand the cookie-jar lingo here. I know what cookies are though. (In relation to computers and web browsers)
03:56 ^🔗	hook54321	Another issue is that IA might not appreciate the idea of a Facebook dummy account.
03:56 ^🔗	yipdw	yeah
03:57 ^🔗	yipdw	a cookie jar is a collection of cookies; you can view them in your browser
03:57 ^🔗	yipdw	some tools, like curl, let you set the cookies manually
04:00 ^🔗	yipdw	wpull lets you do this also; it has a --load-cookies=FILE argument that loads cookies from a cookies.txt file using the Netscape/Mozilla format
04:00 ^🔗	yipdw	Firefox doesn't use cookies.txt anymore but there's tools like https://addons.mozilla.org/en-US/firefox/addon/cookie-exporter/ that will generate cookies.txt-format files
04:00 ^🔗	yipdw	so that's what I mean by cookie-jar manipulation: exporting/modifying that cookies file
04:02 ^🔗	DoomTay	You can also use "document.cookies" in the javascript console
04:02 ^🔗	DoomTay	*document.cookie
04:05 ^🔗		ndiddy has quit IRC (Read error: Connection reset by peer)
04:07 ^🔗	hook54321	I'm confused about this: https://voat.co/v/MeanwhileOnReddit/comments/340988
04:08 ^🔗	hook54321	Quote from the end of the post: "TL;DR The guy running archive.is is a delusional conspiracy nut, who blocks access to his site randomly, including entire countries (in this case Finland)."
04:13 ^🔗	JesseW	"We are not archive.is"
04:26 ^🔗	hook54321	:P
04:40 ^🔗		Sk1d has quit IRC (Ping timeout: 250 seconds)
04:44 ^🔗		DoomTay has quit IRC (Quit: Page closed)
04:44 ^🔗		RichardG has joined #archiveteam-bs
04:47 ^🔗		Sk1d has joined #archiveteam-bs
04:54 ^🔗		tomwsmf has quit IRC (Read error: Operation timed out)
05:08 ^🔗		DoomTay has joined #archiveteam-bs
05:12 ^🔗		RichardG has quit IRC (Ping timeout: 633 seconds)
05:33 ^🔗		RichardG has joined #archiveteam-bs
05:54 ^🔗		VADemon has joined #archiveteam-bs
06:07 ^🔗		RichardG has quit IRC (Ping timeout: 633 seconds)
06:29 ^🔗		JesseW has quit IRC (Ping timeout: 370 seconds)
06:33 ^🔗		DoomTay has quit IRC (Quit: Page closed)
07:24 ^🔗		nightpool has quit IRC (Read error: Operation timed out)
08:04 ^🔗		Honno has joined #archiveteam-bs
08:12 ^🔗		VADemon has quit IRC (Quit: left4dead)
08:17 ^🔗		VADemon has joined #archiveteam-bs
08:22 ^🔗		RichardG has joined #archiveteam-bs
08:25 ^🔗		Honno has quit IRC (Ping timeout: 1208 seconds)
08:26 ^🔗		RichardG has quit IRC (Read error: Connection reset by peer)
08:26 ^🔗		RichardG has joined #archiveteam-bs
08:31 ^🔗		RichardG has quit IRC (Ping timeout: 244 seconds)
08:45 ^🔗		RichardG has joined #archiveteam-bs
08:50 ^🔗		RichardG_ has joined #archiveteam-bs
08:54 ^🔗		RichardG has quit IRC (Ping timeout: 370 seconds)
09:05 ^🔗		RichardG has joined #archiveteam-bs
09:08 ^🔗		RichardG_ has quit IRC (Ping timeout: 370 seconds)
09:13 ^🔗		RichardG has quit IRC (Ping timeout: 370 seconds)
09:17 ^🔗		RichardG has joined #archiveteam-bs
09:25 ^🔗		RichardG_ has joined #archiveteam-bs
09:27 ^🔗		RichardG has quit IRC (Ping timeout: 370 seconds)
09:30 ^🔗		RichardG_ has quit IRC (Ping timeout: 250 seconds)
09:47 ^🔗		RichardG has joined #archiveteam-bs
09:51 ^🔗		RichardG has quit IRC (Ping timeout: 250 seconds)
10:01 ^🔗		RichardG has joined #archiveteam-bs
10:05 ^🔗		RichardG has quit IRC (Ping timeout: 258 seconds)
10:21 ^🔗		Honno has joined #archiveteam-bs
10:28 ^🔗		Sanqui is now known as sanquiAFK
10:45 ^🔗		Honno has quit IRC (Ping timeout: 1208 seconds)
11:01 ^🔗		r3c0d3x has quit IRC (Ping timeout: 260 seconds)
11:07 ^🔗		r3c0d3x has joined #archiveteam-bs
11:12 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
11:12 ^🔗		BartoCH has joined #archiveteam-bs
11:28 ^🔗		GLaDOS has quit IRC (Ping timeout: 260 seconds)
11:31 ^🔗		RichardG has joined #archiveteam-bs
11:41 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
11:59 ^🔗		BartoCH has joined #archiveteam-bs
12:24 ^🔗		Coderjoe has quit IRC (Read error: Operation timed out)
12:27 ^🔗		Coderjoe has joined #archiveteam-bs
12:34 ^🔗		Silvan has joined #archiveteam-bs
12:34 ^🔗		SilSte has quit IRC (Read error: Connection reset by peer)
12:34 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
12:44 ^🔗		metalcamp has joined #archiveteam-bs
13:02 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
13:02 ^🔗		BartoCH has joined #archiveteam-bs
13:29 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
13:36 ^🔗		BartoCH has joined #archiveteam-bs
14:07 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
14:18 ^🔗		BartoCH has joined #archiveteam-bs
14:55 ^🔗		nightpool has joined #archiveteam-bs
14:59 ^🔗		nightpool has quit IRC (Ping timeout: 260 seconds)
15:39 ^🔗		DoomTay has joined #archiveteam-bs
15:54 ^🔗		Coderjoe has quit IRC (Read error: Operation timed out)
15:56 ^🔗		Coderjoe has joined #archiveteam-bs
15:59 ^🔗		Honno has joined #archiveteam-bs
15:59 ^🔗		JesseW has joined #archiveteam-bs
16:02 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
16:17 ^🔗		nightpool has joined #archiveteam-bs
16:19 ^🔗		BartoCH has joined #archiveteam-bs
16:25 ^🔗		DoomTay has quit IRC (Quit: Page closed)
16:29 ^🔗		DoomTay has joined #archiveteam-bs
16:29 ^🔗		Honno has quit IRC (Ping timeout: 1208 seconds)
16:36 ^🔗		DoomTay has quit IRC (Quit: Page closed)
16:46 ^🔗	HCross	it might be a good idea to look at SoundCloud again... site has put itself up for sale. http://www.digitalmusicnews.com/2016/07/27/soundcloud-1-billion-sale-service/
16:52 ^🔗	JesseW	:-(
16:59 ^🔗		ndiddy has joined #archiveteam-bs
17:07 ^🔗		Start_ is now known as Start
17:41 ^🔗		useretail has quit IRC (Remote host closed the connection)
18:01 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
18:08 ^🔗		BartoCH has joined #archiveteam-bs
18:20 ^🔗	yipdw	I gotta get codearchive to whitelist https://github.com/chr15m/drillbit/
18:20 ^🔗	yipdw	this code is doooooope
18:27 ^🔗	arkiver	HCross: yes, but let's first get a yahoo project running
18:39 ^🔗		nightpool has quit IRC (Read error: Operation timed out)
18:53 ^🔗		useretail has joined #archiveteam-bs
19:02 ^🔗		nightpool has joined #archiveteam-bs
19:17 ^🔗		schbirid has joined #archiveteam-bs
19:30 ^🔗		Coderjoe has quit IRC (Read error: Operation timed out)
19:39 ^🔗		JesseW has quit IRC (Ping timeout: 370 seconds)
19:41 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
19:42 ^🔗		tomwsmf has joined #archiveteam-bs
19:48 ^🔗		Coderjoe has joined #archiveteam-bs
19:49 ^🔗		nightpool has quit IRC (Read error: Operation timed out)
19:56 ^🔗		BartoCH has joined #archiveteam-bs
20:11 ^🔗		anjacks0n has joined #archiveteam-bs
20:15 ^🔗		anjacks0n has quit IRC (Ping timeout: 190 seconds)
20:32 ^🔗		nightpool has joined #archiveteam-bs
20:34 ^🔗	dashcloud	yipdw: it needs 10 stars before it would be considered for archiving
20:34 ^🔗	yipdw	yes, I know
20:34 ^🔗	yipdw	they also have a whitelist
20:48 ^🔗		metalcamp has quit IRC (Ping timeout: 244 seconds)
20:55 ^🔗		robink has quit IRC (Ping timeout: 260 seconds)
21:20 ^🔗		Asparagir has joined #archiveteam-bs
21:28 ^🔗		robink has joined #archiveteam-bs
21:39 ^🔗		VADemon has quit IRC (Quit: left4dead)
21:40 ^🔗		JesseW has joined #archiveteam-bs
21:40 ^🔗		schbirid has quit IRC (Quit: Leaving)
22:25 ^🔗	JesseW	http://www.roaming-initiative.com/blog/posts/wtfm -- this is an awesome way to deal with questions, and only just now heard about it
22:26 ^🔗		Asparag-1 has joined #archiveteam-bs
22:27 ^🔗		Asparag-1 has left
22:36 ^🔗		Coderjoe has quit IRC (Read error: Operation timed out)
22:54 ^🔗		Coderjoe has joined #archiveteam-bs
23:08 ^🔗		BlueMaxim has joined #archiveteam-bs
23:08 ^🔗		REiN^ has quit IRC (Ping timeout: 260 seconds)

irclogger-viewer