Time |
Nickname |
Message |
00:01
🔗
|
hook54321 |
Would grab-site run on a Raspberry Pi? :P |
00:01
🔗
|
MrRadar |
Again, I don't see why not. Though it might not be super-fast |
00:02
🔗
|
MrRadar |
I switched to running wpull on an old Core 2 Duo laptop after about a week since every time I started it it would take 15+ seconds to actually start doing work |
00:02
🔗
|
MrRadar |
On the Pi |
00:03
🔗
|
MrRadar |
And it would noticably pause to parse large web pages |
00:03
🔗
|
MrRadar |
Or even not so large pages |
00:03
🔗
|
MrRadar |
Even an old Atom like that is probably multiple times faster than the 1st Raspberry Pi CPU |
00:03
🔗
|
MrRadar |
Which is approximately equivalent to a Pentium 2 |
00:04
🔗
|
hook54321 |
The computer with the Atom processor only has a 160 GB hard drive, I'm pretty sure that will become an issue. |
00:04
🔗
|
MrRadar |
It depends on how big the site is that you're trying to scrape |
00:05
🔗
|
MrRadar |
I think grab-site instructs wpull to split the WARC files at a certain size |
00:05
🔗
|
|
whydomain has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) |
00:05
🔗
|
MrRadar |
So you could copy them off to an external HD as each WARC finishes |
00:06
🔗
|
MrRadar |
Hmm... looking at the source it actually looks like it doesn't by default |
00:07
🔗
|
MrRadar |
Wait, no, I'm wrong about being wrong |
00:07
🔗
|
MrRadar |
A few lines down it adds that option |
00:09
🔗
|
Frogging |
hook54321: it's a bad idea given the flash memory |
00:09
🔗
|
Frogging |
(SD card) |
00:09
🔗
|
MrRadar |
To run it on a Pi? Yeah. I/O performance sucks big time on those, on top of the CPU performance issues |
00:09
🔗
|
Frogging |
constant downloading, uploading, and erasing |
00:11
🔗
|
hook54321 |
Has anyone tried to run archivebot or grab-site on Kali Linux? |
00:12
🔗
|
MrRadar |
Looking it up I see it's Debian-based so you shouldn't have any major trouble |
00:13
🔗
|
hook54321 |
The site I'm archiving is my high schools' website that they use for teacher websites and homework. But it requires a login which is why I'm doing it myself. |
00:14
🔗
|
JesseW |
hook54321: that's a really good idea. THank you for doing that. |
00:16
🔗
|
hook54321 |
It's probably better than my other idea, which is to download everything off of the shared network drive. |
00:17
🔗
|
JesseW |
Both? Both sounds good. |
00:18
🔗
|
DoomTay |
I dunno. Copying from the shared drive sounds easier, assuming you can get to it from wherever you are |
00:19
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
00:23
🔗
|
hook54321 |
The website and the drive contain completely different content. There is a computer login that I can use that isn't associated with any student. There is also a web based remote desktop that I could use, but the guest login doesn't work on that. |
00:24
🔗
|
DoomTay |
Ah |
00:25
🔗
|
DoomTay |
I just hope the staff is cool with what you're doing, lest you risk trampling the site or breaking certain terms |
00:25
🔗
|
hook54321 |
Would they notice if someone copied everything from the network drive onto an external drive? |
00:26
🔗
|
DoomTay |
I have no idea |
00:26
🔗
|
|
mismatch has quit IRC (Ping timeout: 501 seconds) |
00:27
🔗
|
MrRadar |
It probably depends on how much audit logging they're doing and whether they bother to look at them |
00:32
🔗
|
JesseW |
and how fast you do it |
00:33
🔗
|
JesseW |
if, say, you copied one file per hour (average, with a random delay), I'm pretty certain they wouldn't notice or care |
00:57
🔗
|
|
metalcamp has quit IRC (Ping timeout: 501 seconds) |
01:10
🔗
|
ranma |
is donald trump's twitter being backed up actively? |
01:11
🔗
|
ranma |
not for mentally stimulating reasons |
01:14
🔗
|
hook54321 |
By actively do you mean that whenever he posts something it's automatically backed up? Or like a weekly backup? |
01:15
🔗
|
|
nightpool has quit IRC (Read error: Operation timed out) |
01:35
🔗
|
|
Start_ has joined #archiveteam-bs |
01:35
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
01:37
🔗
|
|
ItsYoda has quit IRC (Ping timeout: 260 seconds) |
01:41
🔗
|
|
ItsYoda has joined #archiveteam-bs |
01:52
🔗
|
ranma |
automatically |
01:52
🔗
|
ranma |
whenever |
01:53
🔗
|
ranma |
i wonder if twitter just "hides" deleted posts |
01:53
🔗
|
ranma |
a la IA |
01:54
🔗
|
MrRadar |
It looks like the IA has been capturing it multiple times per day for months now: https://web.archive.org/web/*/https://twitter.com/realdonaldtrump/ |
01:54
🔗
|
ranma |
nice |
01:54
🔗
|
* |
ranma chuckles |
02:07
🔗
|
|
nightpool has joined #archiveteam-bs |
02:21
🔗
|
|
ndiddy has quit IRC (Leaving) |
02:22
🔗
|
|
ndiddy has joined #archiveteam-bs |
02:23
🔗
|
MrRadar |
Yay, Maker.tv finally fixed the blip.tv domain to point to a working server so the IA is no longer blocking it due to robots.txt issues |
02:25
🔗
|
JesseW |
excellent! |
02:26
🔗
|
JesseW |
are you sure IA didn't just change their mind about what a non-working server means? |
02:27
🔗
|
MrRadar |
Hmm... maybe they did. If you go to http://blip.tv it redirects to http://maker.tv |
02:27
🔗
|
MrRadar |
But if you go directly to http://blip.tv/robots.txt you still get a CloudFlare error page |
02:28
🔗
|
MrRadar |
Either way the IA is no longer blocking access to blip, so I'm happy |
02:29
🔗
|
DoomTay |
Clooooudflaaaaare |
03:12
🔗
|
hook54321 |
They've been capturing Hillary's feed significantly less than Trump's... Kinda concerns me, but I hate both of them in different ways but essentially equally. |
03:12
🔗
|
hook54321 |
https://web.archive.org/web/*/https://twitter.com/hillaryclinton/ |
03:15
🔗
|
xmc |
"they" |
03:20
🔗
|
hook54321 |
"they"? |
03:21
🔗
|
xmc |
who are "they" |
03:23
🔗
|
hook54321 |
It could either be the internet archive's algorithm for what gets archived and how often or people manually archiving it through the save now button or both. |
03:28
🔗
|
JesseW |
(pardon the politics, but) I can't speak to your personal emotions, but I very strongly disagree with the implied idea that Hillary and Trump's *ACTIONS*, if elected, would be similar. Hillary is vastly less likely than Trump to radically undermine the stability of the country. |
03:33
🔗
|
yipdw |
JesseW: there's things I want to do on archivebot but I have other, higher priorities |
03:34
🔗
|
yipdw |
I recommend people use grab-site because (a) it is local, (b) with high occurrence, people who ask about archivebot are trying to backup *chan or some shit |
03:34
🔗
|
JesseW |
Ah, that makes sense. |
03:35
🔗
|
JesseW |
I thought the plan was to re-implement the features that are currently only in ArchiveBot in grab-site, then switch over. |
03:37
🔗
|
DoomTay |
Doesn't grab-site only do a third of the job? |
03:37
🔗
|
DoomTay |
I mean, you would still have to upload the WARCs and all that |
03:37
🔗
|
yipdw |
that would be nice |
03:38
🔗
|
yipdw |
I also haven't really been that motivated to work on it, because it fulfills its original mission fine |
03:38
🔗
|
yipdw |
I am not really that interested in making it function for gazillion-page sites or real-time doxing |
03:39
🔗
|
* |
JesseW nods |
03:39
🔗
|
yipdw |
Javascript-heavy sites are a useful exception and that is something I do want to fix |
03:39
🔗
|
JesseW |
I'm not sure what you mean by "real-time doxing" (and I'm not sure I want to know)... |
03:40
🔗
|
DoomTay |
Keeping tabs on twitter feeds? |
03:40
🔗
|
yipdw |
back up a facebook page every 20 minutes, back up a twitter account continuously |
03:40
🔗
|
yipdw |
add login capability, add cookie jars |
03:40
🔗
|
yipdw |
anything to make it pry further |
03:42
🔗
|
hook54321 |
I don't think we need it to do scheduled backups like that, but we should consider trying to do whatever archive.is did to make it so they could archive facebook pages. |
03:42
🔗
|
yipdw |
see |
03:42
🔗
|
yipdw |
give a mouse a cookie |
03:43
🔗
|
hook54321 |
http://1.media.collegehumor.cvcdn.com/80/98/f45d988f4740f60eaea9776e76acbce5.jpg |
03:43
🔗
|
Frogging |
and they'll want a cookie jar |
03:43
🔗
|
JesseW |
yipdw: ah, ok, *now* I understand |
03:43
🔗
|
JesseW |
and I'm pretty sure I agree |
03:44
🔗
|
yipdw |
privacy and security is generally pretty tenuous, even if you're a nerd, and I don't think we need more Well Actually tools that take advantage of that problem in the architecture |
03:44
🔗
|
yipdw |
</soapbox> |
03:44
🔗
|
Frogging |
most nerds seem pro-privacy until it's someone else's :p |
03:45
🔗
|
hook54321 |
What problem? |
03:45
🔗
|
xmc |
pretty much |
03:45
🔗
|
yipdw |
yes, you get that a lot in here |
03:45
🔗
|
hook54321 |
Frogging: pretty much. you could also say archivists (archivers?) |
03:46
🔗
|
yipdw |
hook54321: the problem is that the web is assumed public-by-default and is built to encourage that. I think this is awesome for certain sorts of knowledge (e.g. scientific) but we have gone waaaaay beyond that and even to this date access controls are a mess except to that small population that kinda sorta knows what's going on |
03:47
🔗
|
hook54321 |
what's an example of us going way beyond that? |
03:47
🔗
|
yipdw |
you don't need much to do an access escalation. for example a cookie jar gives you the ability to poke into friends-only Facebook posts and expose them to a larger, not-friends-only audience |
03:48
🔗
|
hook54321 |
it would be a dummy account, no friends or any sort of connections. |
03:48
🔗
|
yipdw |
"we" meaning the sorts of things the Web is now used for |
03:48
🔗
|
yipdw |
I don't trust people to use dummy accounts and I don't trust websites, especially dying websites, to really give a shit |
03:49
🔗
|
hook54321 |
ah. what did you mean by access controls? |
03:49
🔗
|
xmc |
privacy settings |
03:49
🔗
|
yipdw |
friend requests |
03:49
🔗
|
yipdw |
these are access controls |
03:49
🔗
|
yipdw |
we have on occasion breached them (e.g. Friendster) |
03:49
🔗
|
Frogging |
hook54321: for projects that need these things, we can use them. but there's no reason to implement it in something as general-purpose as archivebot |
03:51
🔗
|
yipdw |
anyway this is sort of a long way of saying that yeah there is stuff to do on archivebot but I shelved it, and if someone else would like to dig in, I do have a history of accepting PRs |
03:51
🔗
|
hook54321 |
But isn't it kinda a necessity at this point, because we can't really successfully archive Donald Trump's or Hillary Clinton's Facebook page right now. :P |
03:51
🔗
|
yipdw |
so get their public websites |
03:51
🔗
|
yipdw |
good enough |
03:51
🔗
|
Frogging |
aren't there pages public? |
03:51
🔗
|
Frogging |
their* |
03:51
🔗
|
yipdw |
if they're public even better |
03:52
🔗
|
Frogging |
if they are then I don't get what the issue is |
03:52
🔗
|
hook54321 |
The Facebook pages are public but Facebook hates us and throws captchas at us. |
03:52
🔗
|
Frogging |
oh |
03:53
🔗
|
hook54321 |
Which archive.is has somehow gotten around and however they did it involves a dummy account. |
03:53
🔗
|
yipdw |
I'd recommend using archive.is for now, or using grab-site on your own and copying in a Facebook cookie |
03:54
🔗
|
yipdw |
(at this stage you'd be absolutely right to point out that we have the tools to do everything I said I don't want to encourage. but there's a difference between cookie-jar manipulation and --cookies=) |
03:56
🔗
|
hook54321 |
I'm afraid I don't quite understand the cookie-jar lingo here. I know what cookies are though. (In relation to computers and web browsers) |
03:56
🔗
|
hook54321 |
Another issue is that IA might not appreciate the idea of a Facebook dummy account. |
03:56
🔗
|
yipdw |
yeah |
03:57
🔗
|
yipdw |
a cookie jar is a collection of cookies; you can view them in your browser |
03:57
🔗
|
yipdw |
some tools, like curl, let you set the cookies manually |
04:00
🔗
|
yipdw |
wpull lets you do this also; it has a --load-cookies=FILE argument that loads cookies from a cookies.txt file using the Netscape/Mozilla format |
04:00
🔗
|
yipdw |
Firefox doesn't use cookies.txt anymore but there's tools like https://addons.mozilla.org/en-US/firefox/addon/cookie-exporter/ that will generate cookies.txt-format files |
04:00
🔗
|
yipdw |
so that's what I mean by cookie-jar manipulation: exporting/modifying that cookies file |
04:02
🔗
|
DoomTay |
You can also use "document.cookies" in the javascript console |
04:02
🔗
|
DoomTay |
*document.cookie |
04:05
🔗
|
|
ndiddy has quit IRC (Read error: Connection reset by peer) |
04:07
🔗
|
hook54321 |
I'm confused about this: https://voat.co/v/MeanwhileOnReddit/comments/340988 |
04:08
🔗
|
hook54321 |
Quote from the end of the post: "TL;DR The guy running archive.is is a delusional conspiracy nut, who blocks access to his site randomly, including entire countries (in this case Finland)." |
04:13
🔗
|
JesseW |
"We are not archive.is" |
04:26
🔗
|
hook54321 |
:P |
04:40
🔗
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
04:44
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |
04:44
🔗
|
|
RichardG has joined #archiveteam-bs |
04:47
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:54
🔗
|
|
tomwsmf has quit IRC (Read error: Operation timed out) |
05:08
🔗
|
|
DoomTay has joined #archiveteam-bs |
05:12
🔗
|
|
RichardG has quit IRC (Ping timeout: 633 seconds) |
05:33
🔗
|
|
RichardG has joined #archiveteam-bs |
05:54
🔗
|
|
VADemon has joined #archiveteam-bs |
06:07
🔗
|
|
RichardG has quit IRC (Ping timeout: 633 seconds) |
06:29
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
06:33
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |
07:24
🔗
|
|
nightpool has quit IRC (Read error: Operation timed out) |
08:04
🔗
|
|
Honno has joined #archiveteam-bs |
08:12
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
08:17
🔗
|
|
VADemon has joined #archiveteam-bs |
08:22
🔗
|
|
RichardG has joined #archiveteam-bs |
08:25
🔗
|
|
Honno has quit IRC (Ping timeout: 1208 seconds) |
08:26
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
08:26
🔗
|
|
RichardG has joined #archiveteam-bs |
08:31
🔗
|
|
RichardG has quit IRC (Ping timeout: 244 seconds) |
08:45
🔗
|
|
RichardG has joined #archiveteam-bs |
08:50
🔗
|
|
RichardG_ has joined #archiveteam-bs |
08:54
🔗
|
|
RichardG has quit IRC (Ping timeout: 370 seconds) |
09:05
🔗
|
|
RichardG has joined #archiveteam-bs |
09:08
🔗
|
|
RichardG_ has quit IRC (Ping timeout: 370 seconds) |
09:13
🔗
|
|
RichardG has quit IRC (Ping timeout: 370 seconds) |
09:17
🔗
|
|
RichardG has joined #archiveteam-bs |
09:25
🔗
|
|
RichardG_ has joined #archiveteam-bs |
09:27
🔗
|
|
RichardG has quit IRC (Ping timeout: 370 seconds) |
09:30
🔗
|
|
RichardG_ has quit IRC (Ping timeout: 250 seconds) |
09:47
🔗
|
|
RichardG has joined #archiveteam-bs |
09:51
🔗
|
|
RichardG has quit IRC (Ping timeout: 250 seconds) |
10:01
🔗
|
|
RichardG has joined #archiveteam-bs |
10:05
🔗
|
|
RichardG has quit IRC (Ping timeout: 258 seconds) |
10:21
🔗
|
|
Honno has joined #archiveteam-bs |
10:28
🔗
|
|
Sanqui is now known as sanquiAFK |
10:45
🔗
|
|
Honno has quit IRC (Ping timeout: 1208 seconds) |
11:01
🔗
|
|
r3c0d3x has quit IRC (Ping timeout: 260 seconds) |
11:07
🔗
|
|
r3c0d3x has joined #archiveteam-bs |
11:12
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
11:12
🔗
|
|
BartoCH has joined #archiveteam-bs |
11:28
🔗
|
|
GLaDOS has quit IRC (Ping timeout: 260 seconds) |
11:31
🔗
|
|
RichardG has joined #archiveteam-bs |
11:41
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
11:59
🔗
|
|
BartoCH has joined #archiveteam-bs |
12:24
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
12:27
🔗
|
|
Coderjoe has joined #archiveteam-bs |
12:34
🔗
|
|
Silvan has joined #archiveteam-bs |
12:34
🔗
|
|
SilSte has quit IRC (Read error: Connection reset by peer) |
12:34
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
12:44
🔗
|
|
metalcamp has joined #archiveteam-bs |
13:02
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
13:02
🔗
|
|
BartoCH has joined #archiveteam-bs |
13:29
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
13:36
🔗
|
|
BartoCH has joined #archiveteam-bs |
14:07
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
14:18
🔗
|
|
BartoCH has joined #archiveteam-bs |
14:55
🔗
|
|
nightpool has joined #archiveteam-bs |
14:59
🔗
|
|
nightpool has quit IRC (Ping timeout: 260 seconds) |
15:39
🔗
|
|
DoomTay has joined #archiveteam-bs |
15:54
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
15:56
🔗
|
|
Coderjoe has joined #archiveteam-bs |
15:59
🔗
|
|
Honno has joined #archiveteam-bs |
15:59
🔗
|
|
JesseW has joined #archiveteam-bs |
16:02
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
16:17
🔗
|
|
nightpool has joined #archiveteam-bs |
16:19
🔗
|
|
BartoCH has joined #archiveteam-bs |
16:25
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |
16:29
🔗
|
|
DoomTay has joined #archiveteam-bs |
16:29
🔗
|
|
Honno has quit IRC (Ping timeout: 1208 seconds) |
16:36
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |
16:46
🔗
|
HCross |
it might be a good idea to look at SoundCloud again... site has put itself up for sale. http://www.digitalmusicnews.com/2016/07/27/soundcloud-1-billion-sale-service/ |
16:52
🔗
|
JesseW |
:-( |
16:59
🔗
|
|
ndiddy has joined #archiveteam-bs |
17:07
🔗
|
|
Start_ is now known as Start |
17:41
🔗
|
|
useretail has quit IRC (Remote host closed the connection) |
18:01
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
18:08
🔗
|
|
BartoCH has joined #archiveteam-bs |
18:20
🔗
|
yipdw |
I gotta get codearchive to whitelist https://github.com/chr15m/drillbit/ |
18:20
🔗
|
yipdw |
this code is doooooope |
18:27
🔗
|
arkiver |
HCross: yes, but let's first get a yahoo project running |
18:39
🔗
|
|
nightpool has quit IRC (Read error: Operation timed out) |
18:53
🔗
|
|
useretail has joined #archiveteam-bs |
19:02
🔗
|
|
nightpool has joined #archiveteam-bs |
19:17
🔗
|
|
schbirid has joined #archiveteam-bs |
19:30
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
19:39
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
19:41
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
19:42
🔗
|
|
tomwsmf has joined #archiveteam-bs |
19:48
🔗
|
|
Coderjoe has joined #archiveteam-bs |
19:49
🔗
|
|
nightpool has quit IRC (Read error: Operation timed out) |
19:56
🔗
|
|
BartoCH has joined #archiveteam-bs |
20:11
🔗
|
|
anjacks0n has joined #archiveteam-bs |
20:15
🔗
|
|
anjacks0n has quit IRC (Ping timeout: 190 seconds) |
20:32
🔗
|
|
nightpool has joined #archiveteam-bs |
20:34
🔗
|
dashcloud |
yipdw: it needs 10 stars before it would be considered for archiving |
20:34
🔗
|
yipdw |
yes, I know |
20:34
🔗
|
yipdw |
they also have a whitelist |
20:48
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
20:55
🔗
|
|
robink has quit IRC (Ping timeout: 260 seconds) |
21:20
🔗
|
|
Asparagir has joined #archiveteam-bs |
21:28
🔗
|
|
robink has joined #archiveteam-bs |
21:39
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
21:40
🔗
|
|
JesseW has joined #archiveteam-bs |
21:40
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
22:25
🔗
|
JesseW |
http://www.roaming-initiative.com/blog/posts/wtfm -- this is an awesome way to deal with questions, and only just now heard about it |
22:26
🔗
|
|
Asparag-1 has joined #archiveteam-bs |
22:27
🔗
|
|
Asparag-1 has left |
22:36
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
22:54
🔗
|
|
Coderjoe has joined #archiveteam-bs |
23:08
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
23:08
🔗
|
|
REiN^ has quit IRC (Ping timeout: 260 seconds) |