#archiveteam-bs 2016-07-31,Sun

↑back Search

Time Nickname Message
00:01 🔗 hook54321 Would grab-site run on a Raspberry Pi? :P
00:01 🔗 MrRadar Again, I don't see why not. Though it might not be super-fast
00:02 🔗 MrRadar I switched to running wpull on an old Core 2 Duo laptop after about a week since every time I started it it would take 15+ seconds to actually start doing work
00:02 🔗 MrRadar On the Pi
00:03 🔗 MrRadar And it would noticably pause to parse large web pages
00:03 🔗 MrRadar Or even not so large pages
00:03 🔗 MrRadar Even an old Atom like that is probably multiple times faster than the 1st Raspberry Pi CPU
00:03 🔗 MrRadar Which is approximately equivalent to a Pentium 2
00:04 🔗 hook54321 The computer with the Atom processor only has a 160 GB hard drive, I'm pretty sure that will become an issue.
00:04 🔗 MrRadar It depends on how big the site is that you're trying to scrape
00:05 🔗 MrRadar I think grab-site instructs wpull to split the WARC files at a certain size
00:05 🔗 whydomain has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client)
00:05 🔗 MrRadar So you could copy them off to an external HD as each WARC finishes
00:06 🔗 MrRadar Hmm... looking at the source it actually looks like it doesn't by default
00:07 🔗 MrRadar Wait, no, I'm wrong about being wrong
00:07 🔗 MrRadar A few lines down it adds that option
00:09 🔗 Frogging hook54321: it's a bad idea given the flash memory
00:09 🔗 Frogging (SD card)
00:09 🔗 MrRadar To run it on a Pi? Yeah. I/O performance sucks big time on those, on top of the CPU performance issues
00:09 🔗 Frogging constant downloading, uploading, and erasing
00:11 🔗 hook54321 Has anyone tried to run archivebot or grab-site on Kali Linux?
00:12 🔗 MrRadar Looking it up I see it's Debian-based so you shouldn't have any major trouble
00:13 🔗 hook54321 The site I'm archiving is my high schools' website that they use for teacher websites and homework. But it requires a login which is why I'm doing it myself.
00:14 🔗 JesseW hook54321: that's a really good idea. THank you for doing that.
00:16 🔗 hook54321 It's probably better than my other idea, which is to download everything off of the shared network drive.
00:17 🔗 JesseW Both? Both sounds good.
00:18 🔗 DoomTay I dunno. Copying from the shared drive sounds easier, assuming you can get to it from wherever you are
00:19 🔗 BlueMaxim has joined #archiveteam-bs
00:23 🔗 hook54321 The website and the drive contain completely different content. There is a computer login that I can use that isn't associated with any student. There is also a web based remote desktop that I could use, but the guest login doesn't work on that.
00:24 🔗 DoomTay Ah
00:25 🔗 DoomTay I just hope the staff is cool with what you're doing, lest you risk trampling the site or breaking certain terms
00:25 🔗 hook54321 Would they notice if someone copied everything from the network drive onto an external drive?
00:26 🔗 DoomTay I have no idea
00:26 🔗 mismatch has quit IRC (Ping timeout: 501 seconds)
00:27 🔗 MrRadar It probably depends on how much audit logging they're doing and whether they bother to look at them
00:32 🔗 JesseW and how fast you do it
00:33 🔗 JesseW if, say, you copied one file per hour (average, with a random delay), I'm pretty certain they wouldn't notice or care
00:57 🔗 metalcamp has quit IRC (Ping timeout: 501 seconds)
01:10 🔗 ranma is donald trump's twitter being backed up actively?
01:11 🔗 ranma not for mentally stimulating reasons
01:14 🔗 hook54321 By actively do you mean that whenever he posts something it's automatically backed up? Or like a weekly backup?
01:15 🔗 nightpool has quit IRC (Read error: Operation timed out)
01:35 🔗 Start_ has joined #archiveteam-bs
01:35 🔗 Start has quit IRC (Read error: Connection reset by peer)
01:37 🔗 ItsYoda has quit IRC (Ping timeout: 260 seconds)
01:41 🔗 ItsYoda has joined #archiveteam-bs
01:52 🔗 ranma automatically
01:52 🔗 ranma whenever
01:53 🔗 ranma i wonder if twitter just "hides" deleted posts
01:53 🔗 ranma a la IA
01:54 🔗 MrRadar It looks like the IA has been capturing it multiple times per day for months now: https://web.archive.org/web/*/https://twitter.com/realdonaldtrump/
01:54 🔗 ranma nice
01:54 🔗 * ranma chuckles
02:07 🔗 nightpool has joined #archiveteam-bs
02:21 🔗 ndiddy has quit IRC (Leaving)
02:22 🔗 ndiddy has joined #archiveteam-bs
02:23 🔗 MrRadar Yay, Maker.tv finally fixed the blip.tv domain to point to a working server so the IA is no longer blocking it due to robots.txt issues
02:25 🔗 JesseW excellent!
02:26 🔗 JesseW are you sure IA didn't just change their mind about what a non-working server means?
02:27 🔗 MrRadar Hmm... maybe they did. If you go to http://blip.tv it redirects to http://maker.tv
02:27 🔗 MrRadar But if you go directly to http://blip.tv/robots.txt you still get a CloudFlare error page
02:28 🔗 MrRadar Either way the IA is no longer blocking access to blip, so I'm happy
02:29 🔗 DoomTay Clooooudflaaaaare
03:12 🔗 hook54321 They've been capturing Hillary's feed significantly less than Trump's... Kinda concerns me, but I hate both of them in different ways but essentially equally.
03:12 🔗 hook54321 https://web.archive.org/web/*/https://twitter.com/hillaryclinton/
03:15 🔗 xmc "they"
03:20 🔗 hook54321 "they"?
03:21 🔗 xmc who are "they"
03:23 🔗 hook54321 It could either be the internet archive's algorithm for what gets archived and how often or people manually archiving it through the save now button or both.
03:28 🔗 JesseW (pardon the politics, but) I can't speak to your personal emotions, but I very strongly disagree with the implied idea that Hillary and Trump's *ACTIONS*, if elected, would be similar. Hillary is vastly less likely than Trump to radically undermine the stability of the country.
03:33 🔗 yipdw JesseW: there's things I want to do on archivebot but I have other, higher priorities
03:34 🔗 yipdw I recommend people use grab-site because (a) it is local, (b) with high occurrence, people who ask about archivebot are trying to backup *chan or some shit
03:34 🔗 JesseW Ah, that makes sense.
03:35 🔗 JesseW I thought the plan was to re-implement the features that are currently only in ArchiveBot in grab-site, then switch over.
03:37 🔗 DoomTay Doesn't grab-site only do a third of the job?
03:37 🔗 DoomTay I mean, you would still have to upload the WARCs and all that
03:37 🔗 yipdw that would be nice
03:38 🔗 yipdw I also haven't really been that motivated to work on it, because it fulfills its original mission fine
03:38 🔗 yipdw I am not really that interested in making it function for gazillion-page sites or real-time doxing
03:39 🔗 * JesseW nods
03:39 🔗 yipdw Javascript-heavy sites are a useful exception and that is something I do want to fix
03:39 🔗 JesseW I'm not sure what you mean by "real-time doxing" (and I'm not sure I want to know)...
03:40 🔗 DoomTay Keeping tabs on twitter feeds?
03:40 🔗 yipdw back up a facebook page every 20 minutes, back up a twitter account continuously
03:40 🔗 yipdw add login capability, add cookie jars
03:40 🔗 yipdw anything to make it pry further
03:42 🔗 hook54321 I don't think we need it to do scheduled backups like that, but we should consider trying to do whatever archive.is did to make it so they could archive facebook pages.
03:42 🔗 yipdw see
03:42 🔗 yipdw give a mouse a cookie
03:43 🔗 hook54321 http://1.media.collegehumor.cvcdn.com/80/98/f45d988f4740f60eaea9776e76acbce5.jpg
03:43 🔗 Frogging and they'll want a cookie jar
03:43 🔗 JesseW yipdw: ah, ok, *now* I understand
03:43 🔗 JesseW and I'm pretty sure I agree
03:44 🔗 yipdw privacy and security is generally pretty tenuous, even if you're a nerd, and I don't think we need more Well Actually tools that take advantage of that problem in the architecture
03:44 🔗 yipdw </soapbox>
03:44 🔗 Frogging most nerds seem pro-privacy until it's someone else's :p
03:45 🔗 hook54321 What problem?
03:45 🔗 xmc pretty much
03:45 🔗 yipdw yes, you get that a lot in here
03:45 🔗 hook54321 Frogging: pretty much. you could also say archivists (archivers?)
03:46 🔗 yipdw hook54321: the problem is that the web is assumed public-by-default and is built to encourage that. I think this is awesome for certain sorts of knowledge (e.g. scientific) but we have gone waaaaay beyond that and even to this date access controls are a mess except to that small population that kinda sorta knows what's going on
03:47 🔗 hook54321 what's an example of us going way beyond that?
03:47 🔗 yipdw you don't need much to do an access escalation. for example a cookie jar gives you the ability to poke into friends-only Facebook posts and expose them to a larger, not-friends-only audience
03:48 🔗 hook54321 it would be a dummy account, no friends or any sort of connections.
03:48 🔗 yipdw "we" meaning the sorts of things the Web is now used for
03:48 🔗 yipdw I don't trust people to use dummy accounts and I don't trust websites, especially dying websites, to really give a shit
03:49 🔗 hook54321 ah. what did you mean by access controls?
03:49 🔗 xmc privacy settings
03:49 🔗 yipdw friend requests
03:49 🔗 yipdw these are access controls
03:49 🔗 yipdw we have on occasion breached them (e.g. Friendster)
03:49 🔗 Frogging hook54321: for projects that need these things, we can use them. but there's no reason to implement it in something as general-purpose as archivebot
03:51 🔗 yipdw anyway this is sort of a long way of saying that yeah there is stuff to do on archivebot but I shelved it, and if someone else would like to dig in, I do have a history of accepting PRs
03:51 🔗 hook54321 But isn't it kinda a necessity at this point, because we can't really successfully archive Donald Trump's or Hillary Clinton's Facebook page right now. :P
03:51 🔗 yipdw so get their public websites
03:51 🔗 yipdw good enough
03:51 🔗 Frogging aren't there pages public?
03:51 🔗 Frogging their*
03:51 🔗 yipdw if they're public even better
03:52 🔗 Frogging if they are then I don't get what the issue is
03:52 🔗 hook54321 The Facebook pages are public but Facebook hates us and throws captchas at us.
03:52 🔗 Frogging oh
03:53 🔗 hook54321 Which archive.is has somehow gotten around and however they did it involves a dummy account.
03:53 🔗 yipdw I'd recommend using archive.is for now, or using grab-site on your own and copying in a Facebook cookie
03:54 🔗 yipdw (at this stage you'd be absolutely right to point out that we have the tools to do everything I said I don't want to encourage. but there's a difference between cookie-jar manipulation and --cookies=)
03:56 🔗 hook54321 I'm afraid I don't quite understand the cookie-jar lingo here. I know what cookies are though. (In relation to computers and web browsers)
03:56 🔗 hook54321 Another issue is that IA might not appreciate the idea of a Facebook dummy account.
03:56 🔗 yipdw yeah
03:57 🔗 yipdw a cookie jar is a collection of cookies; you can view them in your browser
03:57 🔗 yipdw some tools, like curl, let you set the cookies manually
04:00 🔗 yipdw wpull lets you do this also; it has a --load-cookies=FILE argument that loads cookies from a cookies.txt file using the Netscape/Mozilla format
04:00 🔗 yipdw Firefox doesn't use cookies.txt anymore but there's tools like https://addons.mozilla.org/en-US/firefox/addon/cookie-exporter/ that will generate cookies.txt-format files
04:00 🔗 yipdw so that's what I mean by cookie-jar manipulation: exporting/modifying that cookies file
04:02 🔗 DoomTay You can also use "document.cookies" in the javascript console
04:02 🔗 DoomTay *document.cookie
04:05 🔗 ndiddy has quit IRC (Read error: Connection reset by peer)
04:07 🔗 hook54321 I'm confused about this: https://voat.co/v/MeanwhileOnReddit/comments/340988
04:08 🔗 hook54321 Quote from the end of the post: "TL;DR The guy running archive.is is a delusional conspiracy nut, who blocks access to his site randomly, including entire countries (in this case Finland)."
04:13 🔗 JesseW "We are not archive.is"
04:26 🔗 hook54321 :P
04:40 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
04:44 🔗 DoomTay has quit IRC (Quit: Page closed)
04:44 🔗 RichardG has joined #archiveteam-bs
04:47 🔗 Sk1d has joined #archiveteam-bs
04:54 🔗 tomwsmf has quit IRC (Read error: Operation timed out)
05:08 🔗 DoomTay has joined #archiveteam-bs
05:12 🔗 RichardG has quit IRC (Ping timeout: 633 seconds)
05:33 🔗 RichardG has joined #archiveteam-bs
05:54 🔗 VADemon has joined #archiveteam-bs
06:07 🔗 RichardG has quit IRC (Ping timeout: 633 seconds)
06:29 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
06:33 🔗 DoomTay has quit IRC (Quit: Page closed)
07:24 🔗 nightpool has quit IRC (Read error: Operation timed out)
08:04 🔗 Honno has joined #archiveteam-bs
08:12 🔗 VADemon has quit IRC (Quit: left4dead)
08:17 🔗 VADemon has joined #archiveteam-bs
08:22 🔗 RichardG has joined #archiveteam-bs
08:25 🔗 Honno has quit IRC (Ping timeout: 1208 seconds)
08:26 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
08:26 🔗 RichardG has joined #archiveteam-bs
08:31 🔗 RichardG has quit IRC (Ping timeout: 244 seconds)
08:45 🔗 RichardG has joined #archiveteam-bs
08:50 🔗 RichardG_ has joined #archiveteam-bs
08:54 🔗 RichardG has quit IRC (Ping timeout: 370 seconds)
09:05 🔗 RichardG has joined #archiveteam-bs
09:08 🔗 RichardG_ has quit IRC (Ping timeout: 370 seconds)
09:13 🔗 RichardG has quit IRC (Ping timeout: 370 seconds)
09:17 🔗 RichardG has joined #archiveteam-bs
09:25 🔗 RichardG_ has joined #archiveteam-bs
09:27 🔗 RichardG has quit IRC (Ping timeout: 370 seconds)
09:30 🔗 RichardG_ has quit IRC (Ping timeout: 250 seconds)
09:47 🔗 RichardG has joined #archiveteam-bs
09:51 🔗 RichardG has quit IRC (Ping timeout: 250 seconds)
10:01 🔗 RichardG has joined #archiveteam-bs
10:05 🔗 RichardG has quit IRC (Ping timeout: 258 seconds)
10:21 🔗 Honno has joined #archiveteam-bs
10:28 🔗 Sanqui is now known as sanquiAFK
10:45 🔗 Honno has quit IRC (Ping timeout: 1208 seconds)
11:01 🔗 r3c0d3x has quit IRC (Ping timeout: 260 seconds)
11:07 🔗 r3c0d3x has joined #archiveteam-bs
11:12 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
11:12 🔗 BartoCH has joined #archiveteam-bs
11:28 🔗 GLaDOS has quit IRC (Ping timeout: 260 seconds)
11:31 🔗 RichardG has joined #archiveteam-bs
11:41 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
11:59 🔗 BartoCH has joined #archiveteam-bs
12:24 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
12:27 🔗 Coderjoe has joined #archiveteam-bs
12:34 🔗 Silvan has joined #archiveteam-bs
12:34 🔗 SilSte has quit IRC (Read error: Connection reset by peer)
12:34 🔗 BlueMaxim has quit IRC (Quit: Leaving)
12:44 🔗 metalcamp has joined #archiveteam-bs
13:02 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
13:02 🔗 BartoCH has joined #archiveteam-bs
13:29 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
13:36 🔗 BartoCH has joined #archiveteam-bs
14:07 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
14:18 🔗 BartoCH has joined #archiveteam-bs
14:55 🔗 nightpool has joined #archiveteam-bs
14:59 🔗 nightpool has quit IRC (Ping timeout: 260 seconds)
15:39 🔗 DoomTay has joined #archiveteam-bs
15:54 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
15:56 🔗 Coderjoe has joined #archiveteam-bs
15:59 🔗 Honno has joined #archiveteam-bs
15:59 🔗 JesseW has joined #archiveteam-bs
16:02 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
16:17 🔗 nightpool has joined #archiveteam-bs
16:19 🔗 BartoCH has joined #archiveteam-bs
16:25 🔗 DoomTay has quit IRC (Quit: Page closed)
16:29 🔗 DoomTay has joined #archiveteam-bs
16:29 🔗 Honno has quit IRC (Ping timeout: 1208 seconds)
16:36 🔗 DoomTay has quit IRC (Quit: Page closed)
16:46 🔗 HCross it might be a good idea to look at SoundCloud again... site has put itself up for sale. http://www.digitalmusicnews.com/2016/07/27/soundcloud-1-billion-sale-service/
16:52 🔗 JesseW :-(
16:59 🔗 ndiddy has joined #archiveteam-bs
17:07 🔗 Start_ is now known as Start
17:41 🔗 useretail has quit IRC (Remote host closed the connection)
18:01 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
18:08 🔗 BartoCH has joined #archiveteam-bs
18:20 🔗 yipdw I gotta get codearchive to whitelist https://github.com/chr15m/drillbit/
18:20 🔗 yipdw this code is doooooope
18:27 🔗 arkiver HCross: yes, but let's first get a yahoo project running
18:39 🔗 nightpool has quit IRC (Read error: Operation timed out)
18:53 🔗 useretail has joined #archiveteam-bs
19:02 🔗 nightpool has joined #archiveteam-bs
19:17 🔗 schbirid has joined #archiveteam-bs
19:30 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
19:39 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
19:41 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
19:42 🔗 tomwsmf has joined #archiveteam-bs
19:48 🔗 Coderjoe has joined #archiveteam-bs
19:49 🔗 nightpool has quit IRC (Read error: Operation timed out)
19:56 🔗 BartoCH has joined #archiveteam-bs
20:11 🔗 anjacks0n has joined #archiveteam-bs
20:15 🔗 anjacks0n has quit IRC (Ping timeout: 190 seconds)
20:32 🔗 nightpool has joined #archiveteam-bs
20:34 🔗 dashcloud yipdw: it needs 10 stars before it would be considered for archiving
20:34 🔗 yipdw yes, I know
20:34 🔗 yipdw they also have a whitelist
20:48 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
20:55 🔗 robink has quit IRC (Ping timeout: 260 seconds)
21:20 🔗 Asparagir has joined #archiveteam-bs
21:28 🔗 robink has joined #archiveteam-bs
21:39 🔗 VADemon has quit IRC (Quit: left4dead)
21:40 🔗 JesseW has joined #archiveteam-bs
21:40 🔗 schbirid has quit IRC (Quit: Leaving)
22:25 🔗 JesseW http://www.roaming-initiative.com/blog/posts/wtfm -- this is an awesome way to deal with questions, and only just now heard about it
22:26 🔗 Asparag-1 has joined #archiveteam-bs
22:27 🔗 Asparag-1 has left
22:36 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
22:54 🔗 Coderjoe has joined #archiveteam-bs
23:08 🔗 BlueMaxim has joined #archiveteam-bs
23:08 🔗 REiN^ has quit IRC (Ping timeout: 260 seconds)

irclogger-viewer