[00:01] Would grab-site run on a Raspberry Pi? :P [00:01] Again, I don't see why not. Though it might not be super-fast [00:02] I switched to running wpull on an old Core 2 Duo laptop after about a week since every time I started it it would take 15+ seconds to actually start doing work [00:02] On the Pi [00:03] And it would noticably pause to parse large web pages [00:03] Or even not so large pages [00:03] Even an old Atom like that is probably multiple times faster than the 1st Raspberry Pi CPU [00:03] Which is approximately equivalent to a Pentium 2 [00:04] The computer with the Atom processor only has a 160 GB hard drive, I'm pretty sure that will become an issue. [00:04] It depends on how big the site is that you're trying to scrape [00:05] I think grab-site instructs wpull to split the WARC files at a certain size [00:05] *** whydomain has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) [00:05] So you could copy them off to an external HD as each WARC finishes [00:06] Hmm... looking at the source it actually looks like it doesn't by default [00:07] Wait, no, I'm wrong about being wrong [00:07] A few lines down it adds that option [00:09] hook54321: it's a bad idea given the flash memory [00:09] (SD card) [00:09] To run it on a Pi? Yeah. I/O performance sucks big time on those, on top of the CPU performance issues [00:09] constant downloading, uploading, and erasing [00:11] Has anyone tried to run archivebot or grab-site on Kali Linux? [00:12] Looking it up I see it's Debian-based so you shouldn't have any major trouble [00:13] The site I'm archiving is my high schools' website that they use for teacher websites and homework. But it requires a login which is why I'm doing it myself. [00:14] hook54321: that's a really good idea. THank you for doing that. [00:16] It's probably better than my other idea, which is to download everything off of the shared network drive. [00:17] Both? Both sounds good. [00:18] I dunno. Copying from the shared drive sounds easier, assuming you can get to it from wherever you are [00:19] *** BlueMaxim has joined #archiveteam-bs [00:23] The website and the drive contain completely different content. There is a computer login that I can use that isn't associated with any student. There is also a web based remote desktop that I could use, but the guest login doesn't work on that. [00:24] Ah [00:25] I just hope the staff is cool with what you're doing, lest you risk trampling the site or breaking certain terms [00:25] Would they notice if someone copied everything from the network drive onto an external drive? [00:26] I have no idea [00:26] *** mismatch has quit IRC (Ping timeout: 501 seconds) [00:27] It probably depends on how much audit logging they're doing and whether they bother to look at them [00:32] and how fast you do it [00:33] if, say, you copied one file per hour (average, with a random delay), I'm pretty certain they wouldn't notice or care [00:57] *** metalcamp has quit IRC (Ping timeout: 501 seconds) [01:10] is donald trump's twitter being backed up actively? [01:11] not for mentally stimulating reasons [01:14] By actively do you mean that whenever he posts something it's automatically backed up? Or like a weekly backup? [01:15] *** nightpool has quit IRC (Read error: Operation timed out) [01:35] *** Start_ has joined #archiveteam-bs [01:35] *** Start has quit IRC (Read error: Connection reset by peer) [01:37] *** ItsYoda has quit IRC (Ping timeout: 260 seconds) [01:41] *** ItsYoda has joined #archiveteam-bs [01:52] automatically [01:52] whenever [01:53] i wonder if twitter just "hides" deleted posts [01:53] a la IA [01:54] It looks like the IA has been capturing it multiple times per day for months now: https://web.archive.org/web/*/https://twitter.com/realdonaldtrump/ [01:54] nice [01:54] * ranma chuckles [02:07] *** nightpool has joined #archiveteam-bs [02:21] *** ndiddy has quit IRC (Leaving) [02:22] *** ndiddy has joined #archiveteam-bs [02:23] Yay, Maker.tv finally fixed the blip.tv domain to point to a working server so the IA is no longer blocking it due to robots.txt issues [02:25] excellent! [02:26] are you sure IA didn't just change their mind about what a non-working server means? [02:27] Hmm... maybe they did. If you go to http://blip.tv it redirects to http://maker.tv [02:27] But if you go directly to http://blip.tv/robots.txt you still get a CloudFlare error page [02:28] Either way the IA is no longer blocking access to blip, so I'm happy [02:29] Clooooudflaaaaare [03:12] They've been capturing Hillary's feed significantly less than Trump's... Kinda concerns me, but I hate both of them in different ways but essentially equally. [03:12] https://web.archive.org/web/*/https://twitter.com/hillaryclinton/ [03:15] "they" [03:20] "they"? [03:21] who are "they" [03:23] It could either be the internet archive's algorithm for what gets archived and how often or people manually archiving it through the save now button or both. [03:28] (pardon the politics, but) I can't speak to your personal emotions, but I very strongly disagree with the implied idea that Hillary and Trump's *ACTIONS*, if elected, would be similar. Hillary is vastly less likely than Trump to radically undermine the stability of the country. [03:33] JesseW: there's things I want to do on archivebot but I have other, higher priorities [03:34] I recommend people use grab-site because (a) it is local, (b) with high occurrence, people who ask about archivebot are trying to backup *chan or some shit [03:34] Ah, that makes sense. [03:35] I thought the plan was to re-implement the features that are currently only in ArchiveBot in grab-site, then switch over. [03:37] Doesn't grab-site only do a third of the job? [03:37] I mean, you would still have to upload the WARCs and all that [03:37] that would be nice [03:38] I also haven't really been that motivated to work on it, because it fulfills its original mission fine [03:38] I am not really that interested in making it function for gazillion-page sites or real-time doxing [03:39] * JesseW nods [03:39] Javascript-heavy sites are a useful exception and that is something I do want to fix [03:39] I'm not sure what you mean by "real-time doxing" (and I'm not sure I want to know)... [03:40] Keeping tabs on twitter feeds? [03:40] back up a facebook page every 20 minutes, back up a twitter account continuously [03:40] add login capability, add cookie jars [03:40] anything to make it pry further [03:42] I don't think we need it to do scheduled backups like that, but we should consider trying to do whatever archive.is did to make it so they could archive facebook pages. [03:42] see [03:42] give a mouse a cookie [03:43] http://1.media.collegehumor.cvcdn.com/80/98/f45d988f4740f60eaea9776e76acbce5.jpg [03:43] and they'll want a cookie jar [03:43] yipdw: ah, ok, *now* I understand [03:43] and I'm pretty sure I agree [03:44] privacy and security is generally pretty tenuous, even if you're a nerd, and I don't think we need more Well Actually tools that take advantage of that problem in the architecture [03:44] [03:44] most nerds seem pro-privacy until it's someone else's :p [03:45] What problem? [03:45] pretty much [03:45] yes, you get that a lot in here [03:45] Frogging: pretty much. you could also say archivists (archivers?) [03:46] hook54321: the problem is that the web is assumed public-by-default and is built to encourage that. I think this is awesome for certain sorts of knowledge (e.g. scientific) but we have gone waaaaay beyond that and even to this date access controls are a mess except to that small population that kinda sorta knows what's going on [03:47] what's an example of us going way beyond that? [03:47] you don't need much to do an access escalation. for example a cookie jar gives you the ability to poke into friends-only Facebook posts and expose them to a larger, not-friends-only audience [03:48] it would be a dummy account, no friends or any sort of connections. [03:48] "we" meaning the sorts of things the Web is now used for [03:48] I don't trust people to use dummy accounts and I don't trust websites, especially dying websites, to really give a shit [03:49] ah. what did you mean by access controls? [03:49] privacy settings [03:49] friend requests [03:49] these are access controls [03:49] we have on occasion breached them (e.g. Friendster) [03:49] hook54321: for projects that need these things, we can use them. but there's no reason to implement it in something as general-purpose as archivebot [03:51] anyway this is sort of a long way of saying that yeah there is stuff to do on archivebot but I shelved it, and if someone else would like to dig in, I do have a history of accepting PRs [03:51] But isn't it kinda a necessity at this point, because we can't really successfully archive Donald Trump's or Hillary Clinton's Facebook page right now. :P [03:51] so get their public websites [03:51] good enough [03:51] aren't there pages public? [03:51] their* [03:51] if they're public even better [03:52] if they are then I don't get what the issue is [03:52] The Facebook pages are public but Facebook hates us and throws captchas at us. [03:52] oh [03:53] Which archive.is has somehow gotten around and however they did it involves a dummy account. [03:53] I'd recommend using archive.is for now, or using grab-site on your own and copying in a Facebook cookie [03:54] (at this stage you'd be absolutely right to point out that we have the tools to do everything I said I don't want to encourage. but there's a difference between cookie-jar manipulation and --cookies=) [03:56] I'm afraid I don't quite understand the cookie-jar lingo here. I know what cookies are though. (In relation to computers and web browsers) [03:56] Another issue is that IA might not appreciate the idea of a Facebook dummy account. [03:56] yeah [03:57] a cookie jar is a collection of cookies; you can view them in your browser [03:57] some tools, like curl, let you set the cookies manually [04:00] wpull lets you do this also; it has a --load-cookies=FILE argument that loads cookies from a cookies.txt file using the Netscape/Mozilla format [04:00] Firefox doesn't use cookies.txt anymore but there's tools like https://addons.mozilla.org/en-US/firefox/addon/cookie-exporter/ that will generate cookies.txt-format files [04:00] so that's what I mean by cookie-jar manipulation: exporting/modifying that cookies file [04:02] You can also use "document.cookies" in the javascript console [04:02] *document.cookie [04:05] *** ndiddy has quit IRC (Read error: Connection reset by peer) [04:07] I'm confused about this: https://voat.co/v/MeanwhileOnReddit/comments/340988 [04:08] Quote from the end of the post: "TL;DR The guy running archive.is is a delusional conspiracy nut, who blocks access to his site randomly, including entire countries (in this case Finland)." [04:13] "We are not archive.is" [04:26] :P [04:40] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:44] *** DoomTay has quit IRC (Quit: Page closed) [04:44] *** RichardG has joined #archiveteam-bs [04:47] *** Sk1d has joined #archiveteam-bs [04:54] *** tomwsmf has quit IRC (Read error: Operation timed out) [05:08] *** DoomTay has joined #archiveteam-bs [05:12] *** RichardG has quit IRC (Ping timeout: 633 seconds) [05:33] *** RichardG has joined #archiveteam-bs [05:54] *** VADemon has joined #archiveteam-bs [06:07] *** RichardG has quit IRC (Ping timeout: 633 seconds) [06:29] *** JesseW has quit IRC (Ping timeout: 370 seconds) [06:33] *** DoomTay has quit IRC (Quit: Page closed) [07:24] *** nightpool has quit IRC (Read error: Operation timed out) [08:04] *** Honno has joined #archiveteam-bs [08:12] *** VADemon has quit IRC (Quit: left4dead) [08:17] *** VADemon has joined #archiveteam-bs [08:22] *** RichardG has joined #archiveteam-bs [08:25] *** Honno has quit IRC (Ping timeout: 1208 seconds) [08:26] *** RichardG has quit IRC (Read error: Connection reset by peer) [08:26] *** RichardG has joined #archiveteam-bs [08:31] *** RichardG has quit IRC (Ping timeout: 244 seconds) [08:45] *** RichardG has joined #archiveteam-bs [08:50] *** RichardG_ has joined #archiveteam-bs [08:54] *** RichardG has quit IRC (Ping timeout: 370 seconds) [09:05] *** RichardG has joined #archiveteam-bs [09:08] *** RichardG_ has quit IRC (Ping timeout: 370 seconds) [09:13] *** RichardG has quit IRC (Ping timeout: 370 seconds) [09:17] *** RichardG has joined #archiveteam-bs [09:25] *** RichardG_ has joined #archiveteam-bs [09:27] *** RichardG has quit IRC (Ping timeout: 370 seconds) [09:30] *** RichardG_ has quit IRC (Ping timeout: 250 seconds) [09:47] *** RichardG has joined #archiveteam-bs [09:51] *** RichardG has quit IRC (Ping timeout: 250 seconds) [10:01] *** RichardG has joined #archiveteam-bs [10:05] *** RichardG has quit IRC (Ping timeout: 258 seconds) [10:21] *** Honno has joined #archiveteam-bs [10:28] *** Sanqui is now known as sanquiAFK [10:45] *** Honno has quit IRC (Ping timeout: 1208 seconds) [11:01] *** r3c0d3x has quit IRC (Ping timeout: 260 seconds) [11:07] *** r3c0d3x has joined #archiveteam-bs [11:12] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [11:12] *** BartoCH has joined #archiveteam-bs [11:28] *** GLaDOS has quit IRC (Ping timeout: 260 seconds) [11:31] *** RichardG has joined #archiveteam-bs [11:41] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [11:59] *** BartoCH has joined #archiveteam-bs [12:24] *** Coderjoe has quit IRC (Read error: Operation timed out) [12:27] *** Coderjoe has joined #archiveteam-bs [12:34] *** Silvan has joined #archiveteam-bs [12:34] *** SilSte has quit IRC (Read error: Connection reset by peer) [12:34] *** BlueMaxim has quit IRC (Quit: Leaving) [12:44] *** metalcamp has joined #archiveteam-bs [13:02] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [13:02] *** BartoCH has joined #archiveteam-bs [13:29] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [13:36] *** BartoCH has joined #archiveteam-bs [14:07] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [14:18] *** BartoCH has joined #archiveteam-bs [14:55] *** nightpool has joined #archiveteam-bs [14:59] *** nightpool has quit IRC (Ping timeout: 260 seconds) [15:39] *** DoomTay has joined #archiveteam-bs [15:54] *** Coderjoe has quit IRC (Read error: Operation timed out) [15:56] *** Coderjoe has joined #archiveteam-bs [15:59] *** Honno has joined #archiveteam-bs [15:59] *** JesseW has joined #archiveteam-bs [16:02] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [16:17] *** nightpool has joined #archiveteam-bs [16:19] *** BartoCH has joined #archiveteam-bs [16:25] *** DoomTay has quit IRC (Quit: Page closed) [16:29] *** DoomTay has joined #archiveteam-bs [16:29] *** Honno has quit IRC (Ping timeout: 1208 seconds) [16:36] *** DoomTay has quit IRC (Quit: Page closed) [16:46] it might be a good idea to look at SoundCloud again... site has put itself up for sale. http://www.digitalmusicnews.com/2016/07/27/soundcloud-1-billion-sale-service/ [16:52] :-( [16:59] *** ndiddy has joined #archiveteam-bs [17:07] *** Start_ is now known as Start [17:41] *** useretail has quit IRC (Remote host closed the connection) [18:01] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [18:08] *** BartoCH has joined #archiveteam-bs [18:20] I gotta get codearchive to whitelist https://github.com/chr15m/drillbit/ [18:20] this code is doooooope [18:27] HCross: yes, but let's first get a yahoo project running [18:39] *** nightpool has quit IRC (Read error: Operation timed out) [18:53] *** useretail has joined #archiveteam-bs [19:02] *** nightpool has joined #archiveteam-bs [19:17] *** schbirid has joined #archiveteam-bs [19:30] *** Coderjoe has quit IRC (Read error: Operation timed out) [19:39] *** JesseW has quit IRC (Ping timeout: 370 seconds) [19:41] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [19:42] *** tomwsmf has joined #archiveteam-bs [19:48] *** Coderjoe has joined #archiveteam-bs [19:49] *** nightpool has quit IRC (Read error: Operation timed out) [19:56] *** BartoCH has joined #archiveteam-bs [20:11] *** anjacks0n has joined #archiveteam-bs [20:15] *** anjacks0n has quit IRC (Ping timeout: 190 seconds) [20:32] *** nightpool has joined #archiveteam-bs [20:34] yipdw: it needs 10 stars before it would be considered for archiving [20:34] yes, I know [20:34] they also have a whitelist [20:48] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [20:55] *** robink has quit IRC (Ping timeout: 260 seconds) [21:20] *** Asparagir has joined #archiveteam-bs [21:28] *** robink has joined #archiveteam-bs [21:39] *** VADemon has quit IRC (Quit: left4dead) [21:40] *** JesseW has joined #archiveteam-bs [21:40] *** schbirid has quit IRC (Quit: Leaving) [22:25] http://www.roaming-initiative.com/blog/posts/wtfm -- this is an awesome way to deal with questions, and only just now heard about it [22:26] *** Asparag-1 has joined #archiveteam-bs [22:27] *** Asparag-1 has left [22:36] *** Coderjoe has quit IRC (Read error: Operation timed out) [22:54] *** Coderjoe has joined #archiveteam-bs [23:08] *** BlueMaxim has joined #archiveteam-bs [23:08] *** REiN^ has quit IRC (Ping timeout: 260 seconds)