Time |
Nickname |
Message |
00:08
🔗
|
i336_ |
the ex.ua project just got started in the tracker, everyone, so you might like to start running it. it's currently at 0 |
00:31
🔗
|
|
brayden has joined #archiveteam-bs |
00:31
🔗
|
|
swebb sets mode: +o brayden |
01:07
🔗
|
godane |
can anyone figure out why this domain is going slow: http://archive1.rthk.hk/mp4/tv/2013/201303261900_h.mp4 |
01:12
🔗
|
HCross2 |
It's in Hong Kong, a long way from you in the states |
01:26
🔗
|
godane |
i maybe able to get a few shows from there english site |
01:27
🔗
|
godane |
but still is all hosted in Hong Kong |
01:27
🔗
|
godane |
good news is ffmpeg downloads the streams faster then wget |
01:41
🔗
|
i336_ |
godane: maybe wget's UA is ratelimited |
01:42
🔗
|
i336_ |
also, just to let everyone know, ex.ua is up, but not many are running it - check your warriors! |
02:05
🔗
|
i336_ |
wat |
02:05
🔗
|
i336_ |
my warrior is sitting there doing nothing |
02:05
🔗
|
i336_ |
what do I do now? |
02:05
🔗
|
i336_ |
can I ^C and restart it? |
02:08
🔗
|
yipdw |
is it really doing nothing, or is it working |
02:08
🔗
|
yipdw |
you have 23 open claims |
02:08
🔗
|
i336_ |
the last lines are: |
02:08
🔗
|
i336_ |
Finished WgetDownload for Item filelists:58570500-58570999 |
02:08
🔗
|
i336_ |
Starting PrepareStatsForTracker for Item filelists:58570500-58570999 |
02:08
🔗
|
i336_ |
Finished PrepareStatsForTracker for Item filelists:58570500-58570999 |
02:08
🔗
|
i336_ |
Starting MoveFiles for Item filelists:58570500-58570999 |
02:08
🔗
|
i336_ |
Finished MoveFiles for Item filelists:58570500-58570999 |
02:08
🔗
|
i336_ |
(sorry) |
02:08
🔗
|
i336_ |
but yeah. it's stuck. sitting there. |
02:10
🔗
|
yipdw |
please use pastebins or similar for that sort of stuff |
02:10
🔗
|
yipdw |
anyway, the next step after MoveFiles is a concurrency-limited rsync upload |
02:10
🔗
|
i336_ |
oh... |
02:10
🔗
|
i336_ |
and sorry |
02:10
🔗
|
yipdw |
the default is 1 rsync uploader per process |
02:11
🔗
|
i336_ |
I did --concurrency 20 |
02:11
🔗
|
yipdw |
if you have 23 runners in the same process, 22 are going to wait |
02:11
🔗
|
yipdw |
there's the problem |
02:11
🔗
|
i336_ |
??? |
02:11
🔗
|
i336_ |
did I break my crawl? :( |
02:11
🔗
|
yipdw |
https://github.com/ArchiveTeam/exua-grab/blob/master/pipeline.py#L282 |
02:12
🔗
|
yipdw |
rsync processes per pipeline process is limited to 4, defaults to 1 |
02:12
🔗
|
i336_ |
hmm. |
02:12
🔗
|
yipdw |
there's a few reasons why we do this |
02:12
🔗
|
yipdw |
the biggest reason is that that's been part of the code that gets copied across projects |
02:12
🔗
|
i336_ |
I see |
02:12
🔗
|
yipdw |
the other reasons include connections typically being asymmetric and limited number of connections per rsync host |
02:13
🔗
|
i336_ |
mmmm, right |
02:13
🔗
|
i336_ |
well, I may have a bigger problem: iftop on the machine in question is currently only talking to 192.168.x.x and my IP address. |
02:13
🔗
|
yipdw |
in any case, your --concurrent 20 is going to block 19 workers at the rsync stage |
02:13
🔗
|
i336_ |
showing the machine talking* |
02:13
🔗
|
i336_ |
oh yikes |
02:13
🔗
|
i336_ |
if I ^C it, will it restart from scratch? |
02:13
🔗
|
yipdw |
no, the claims you have out will remain out until someone recycles them |
02:14
🔗
|
yipdw |
restarting the process will restart the pipeline from the fetch stage |
02:14
🔗
|
i336_ |
I mean - will the data I've been downloaded be found and sent? |
02:14
🔗
|
i336_ |
ah. |
02:14
🔗
|
yipdw |
no, that data will only be sent as part of the rsync stage |
02:14
🔗
|
yipdw |
typically it doesn't matter |
02:14
🔗
|
i336_ |
okay. so should I ^C it then? how many workers should I run with/ |
02:14
🔗
|
i336_ |
? |
02:14
🔗
|
yipdw |
4 |
02:14
🔗
|
yipdw |
or 2 |
02:15
🔗
|
* |
i336_ restarts it with 3 :P |
02:15
🔗
|
yipdw |
the warrior VM has a limit of 6 |
02:15
🔗
|
i336_ |
I see |
02:15
🔗
|
* |
i336_ uses 4 then |
02:15
🔗
|
|
jrwr has joined #archiveteam-bs |
02:15
🔗
|
i336_ |
....it's now saying "stopping when current tasks are completed" and waiting. |
02:15
🔗
|
i336_ |
I'm curious what it's waiting for. |
02:15
🔗
|
yipdw |
waiting for tasks to complete |
02:16
🔗
|
i336_ |
yeah - but... what? |
02:16
🔗
|
yipdw |
a task is one full trip through the pipeline |
02:16
🔗
|
yipdw |
i.e. rsync upload |
02:16
🔗
|
i336_ |
rsync isn't running _at all_ |
02:16
🔗
|
i336_ |
and it is installed, fwiw |
02:16
🔗
|
|
RichardG_ is now known as RichardG |
02:17
🔗
|
i336_ |
checking htop, run-pipeline has no child processes running underneath it. |
02:17
🔗
|
yipdw |
interrupt again to force-quit the process and this time run with a more sensible number of workers |
02:17
🔗
|
yipdw |
like 2 |
02:17
🔗
|
yipdw |
if there is a problem spawning rsync, that'll make it much easier to diagnose |
02:17
🔗
|
i336_ |
okay :( |
02:17
🔗
|
i336_ |
done |
02:20
🔗
|
i336_ |
argh. "out" just went from 23 to 3 :< |
02:20
🔗
|
yipdw |
it's fine |
02:20
🔗
|
yipdw |
I requeued them |
02:20
🔗
|
i336_ |
yeah |
02:20
🔗
|
i336_ |
and I just figured it out |
02:20
🔗
|
i336_ |
I started 20 processes |
02:20
🔗
|
i336_ |
the ratelimiter was pausing them |
02:20
🔗
|
i336_ |
so they were literally waiting to run |
02:20
🔗
|
i336_ |
right? |
02:21
🔗
|
yipdw |
they may have gotten to some stage of completion |
02:21
🔗
|
yipdw |
but they're not going to be counted as done until the work item makes it through the pipeline and checked in |
02:21
🔗
|
i336_ |
yup. |
02:22
🔗
|
i336_ |
I saw a lot of messages about "we don't want to overload this resource so we're waiting" at the start of the run |
02:22
🔗
|
i336_ |
heh |
02:22
🔗
|
yipdw |
that's a tracker-side rate limit |
02:22
🔗
|
i336_ |
oh okay |
02:22
🔗
|
i336_ |
I'm not sure then. |
02:22
🔗
|
i336_ |
hopefully this works |
02:23
🔗
|
nicolas17 |
is there any point in adding more warrior nodes once the tracker rate limiter is already being hit? |
02:23
🔗
|
arkiver |
the limit is currently at 2 items per minute |
02:23
🔗
|
arkiver |
I'll raise it if the site can handle it |
02:23
🔗
|
yipdw |
nicolas17: for a given project, not really |
02:24
🔗
|
yipdw |
they might be better utilized on some other warrior project |
02:24
🔗
|
arkiver |
for this project it wouldn't hurt, since we don't really know where the limit is. Just keeping it low for now to see if the site handle it |
02:25
🔗
|
i336_ |
I see |
02:25
🔗
|
arkiver |
and will raise it as long as the site remain stable (since we also have only 20 days) |
02:26
🔗
|
i336_ |
mmmm |
02:32
🔗
|
yipdw |
exua-grab so far is proceeding normally here |
02:32
🔗
|
i336_ |
ok. it just got to where it stalled before |
02:33
🔗
|
i336_ |
yipdw: htop is showing nothing running underneath python again |
02:33
🔗
|
i336_ |
yipdw: note that this is running the crawler directly from git, on freebsd |
02:34
🔗
|
i336_ |
a) what can I look for/at? where's the debug/status info? what can I inspect for sanity? b) you can SSH in if you want |
02:35
🔗
|
yipdw |
we don't really run this code on FreeBSD that often |
02:35
🔗
|
i336_ |
I realize that - but my friend's PC with the ZFS pool is running freebsd, so I'm trying to use it |
02:35
🔗
|
i336_ |
Kaz didn't mention if I could use his VPS for this so I haven't |
02:36
🔗
|
joepie91 |
hey, a FreeBSD user |
02:36
🔗
|
joepie91 |
:P |
02:36
🔗
|
yipdw |
a warrior client doesn't need a ZFS pool |
02:36
🔗
|
i336_ |
in this context I mean "pile of diskspace" |
02:36
🔗
|
yipdw |
I know what a ZFS pool is |
02:36
🔗
|
i336_ |
right |
02:36
🔗
|
yipdw |
it's still not really needed for a warrior |
02:36
🔗
|
joepie91 |
warriors don't usually need a lot of disk space fwiw |
02:36
🔗
|
i336_ |
yeah |
02:37
🔗
|
nicolas17 |
warriors download, upload, delete |
02:37
🔗
|
yipdw |
I have a FreeBSD system here, I'll try to debug |
02:37
🔗
|
i336_ |
unfortunately, at this point I also mean "PC with bandwidth", my own internet is 50GB/mo and HTML5(TM) uses most of that sadly (yup) |
02:37
🔗
|
i336_ |
yipdw: wget-lua was fun to build, but I got it working |
02:37
🔗
|
nicolas17 |
i336_: hope you use an adblocker |
02:37
🔗
|
yipdw |
I have exua-grab uploading filelists:82940500-82940999 |
02:37
🔗
|
i336_ |
nicolas17: yup, /etc/hosts file |
02:37
🔗
|
yipdw |
and done |
02:37
🔗
|
joepie91 |
i336_: I'm guessing you ran into this? https://github.com/joepie91/isohunt-grab#for-freebsd |
02:38
🔗
|
* |
nicolas17 gets 50MB/day on his phone |
02:38
🔗
|
yipdw |
ok, so we know the grab works on Ubuntu |
02:38
🔗
|
i336_ |
nicolas17: wow. |
02:38
🔗
|
i336_ |
joepie91: yup, and managed to get past it |
02:38
🔗
|
nicolas17 |
still cheaper than communicating over SMS :P |
02:38
🔗
|
i336_ |
lol, yeah |
02:38
🔗
|
joepie91 |
i336_: always fun to hear that issues from several years ago are still issues :P |
02:38
🔗
|
i336_ |
hahaha |
02:39
🔗
|
* |
i336_ swat |
02:43
🔗
|
yipdw |
I'll resume poking at this in a bit; I need to hop on a conference call |
02:43
🔗
|
i336_ |
okay. thanks! |
02:44
🔗
|
yipdw |
in the meantime, if you can run the grabber on a Linux-ish machine you may have better luck |
02:44
🔗
|
* |
nicolas17 has a bored EC2, should look into it |
02:45
🔗
|
* |
i336_ volunteers to be sysadmin |
02:45
🔗
|
i336_ |
(for exua crawling specifically :P) |
02:45
🔗
|
i336_ |
although it's not hard, tbh. |
02:57
🔗
|
|
ndiddy has quit IRC (Quit: Leaving) |
02:59
🔗
|
compu_85 |
the warrior seems to be running fine for me on this project |
03:00
🔗
|
compu_85 |
so far |
03:06
🔗
|
i336_ |
okay, </lunch> |
03:06
🔗
|
i336_ |
time to see if I can figure out what's going on |
03:06
🔗
|
i336_ |
hopefully I can |
03:06
🔗
|
i336_ |
it's still stalled |
03:14
🔗
|
|
RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) |
03:32
🔗
|
|
Fletcher has quit IRC (Ping timeout: 244 seconds) |
03:40
🔗
|
|
Fletcher has joined #archiveteam-bs |
04:01
🔗
|
|
jrwr has quit IRC (Remote host closed the connection) |
04:07
🔗
|
i336_ |
okay, this is really annoying |
04:07
🔗
|
i336_ |
I enabled the python debugger, but ex.ua has ratelimited me so it's taking hours to crash |
04:07
🔗
|
i336_ |
I feel so stupid downloading content just to make it crash >.< |
04:08
🔗
|
i336_ |
I think I need to take a break for a little while. I'm just frustrated that I don't have resources. not anyone's fault but my own, I'm just really aware that ex.ua will be gone in a few days and there's really nothing I can do about it and it makes me really sad :( |
04:45
🔗
|
SketchCow |
Hurrah |
04:46
🔗
|
SketchCow |
PurpleSym: The next round through, I'm sure that'll happen |
04:48
🔗
|
i336_ |
SketchCow: I got your email reply back - thanks so much for that. The exua archiver project is running in the tracker! arkiver's current project is to save the file references, I'm also hoping we can save the discussions on the site as well. There are a lot of access vectors ex.ua forgot to turn off :D |
04:49
🔗
|
i336_ |
SketchCow: I understand (but have no real information) that there might be some discussions on Monday regarding the actual content on the site. That will be /interesting/, I'm sure. :) |
04:56
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
05:17
🔗
|
SketchCow |
You got Arkiver's |
05:17
🔗
|
SketchCow |
Arkiver's the dude |
05:17
🔗
|
i336_ |
^^ |
05:18
🔗
|
i336_ |
okay then |
05:19
🔗
|
i336_ |
I'll reiterate my question |
05:20
🔗
|
i336_ |
I'm trying to figure out how to prototype new crawler projects locally... where should I look on the wiki? |
05:22
🔗
|
i336_ |
...I don't really want to set up a full tracker. is there no local mode? |
05:23
🔗
|
i336_ |
I'm going out now, I'm really looking forward to figuring this out, if anyone can answer while I'm gone I'd really appreciate it. I want to try and help with my own archiving code |
05:24
🔗
|
i336_ |
i336_ will disconnect in a couple minutes but i336 is still here, so I'll still see what everyone says |
05:25
🔗
|
yipdw |
there isn't a local mode; you're running a test tracker project, a test tracker, or you substitute in a mock |
05:25
🔗
|
yipdw |
the third option hasn't been implemented |
05:26
🔗
|
yipdw |
you can accelerate a local tracker setup with https://github.com/ArchiveTeam/archiveteam-dev-env |
05:26
🔗
|
yipdw |
specifically, the linked OVA |
05:26
🔗
|
yipdw |
#warrior is for discussion of these tools |
05:29
🔗
|
|
i336_ has quit IRC (Read error: Operation timed out) |
05:33
🔗
|
|
ravetcofx has joined #archiveteam-bs |
05:36
🔗
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
05:43
🔗
|
|
Sk1d has joined #archiveteam-bs |
05:51
🔗
|
|
nicolas17 has quit IRC (Quit: nuff 4 2day) |
06:07
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
06:08
🔗
|
|
Sk1d has joined #archiveteam-bs |
06:19
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
07:24
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
07:31
🔗
|
|
Start has joined #archiveteam-bs |
07:38
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
08:02
🔗
|
|
GE has joined #archiveteam-bs |
08:48
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
09:13
🔗
|
|
ravetcofx has quit IRC (Read error: Operation timed out) |
09:23
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
10:41
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
11:06
🔗
|
|
GE has joined #archiveteam-bs |
11:21
🔗
|
arkiver |
If anyone here 'lain'? or anyone knows who that it? |
11:21
🔗
|
arkiver |
is* |
11:28
🔗
|
Sanqui |
there may be several lains. I know somebody who has used that nick in the past. |
11:28
🔗
|
Sanqui |
(several years ago, though.) |
11:54
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
12:03
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
12:41
🔗
|
|
Whopper has joined #archiveteam-bs |
13:15
🔗
|
|
RichardG has joined #archiveteam-bs |
13:33
🔗
|
|
RichardG has quit IRC (Ping timeout: 244 seconds) |
13:34
🔗
|
|
RichardG has joined #archiveteam-bs |
13:42
🔗
|
|
GE has joined #archiveteam-bs |
13:53
🔗
|
|
_desu___ has joined #archiveteam-bs |
13:59
🔗
|
|
Ctrl-S___ has joined #archiveteam-bs |
14:00
🔗
|
|
antonizoo has quit IRC () |
14:01
🔗
|
|
antonizoo has joined #archiveteam-bs |
15:12
🔗
|
|
hook54321 has quit IRC () |
15:12
🔗
|
|
Yoshimura has quit IRC (Ping timeout: 255 seconds) |
15:12
🔗
|
|
hook54321 has joined #archiveteam-bs |
15:53
🔗
|
|
sep332 has joined #archiveteam-bs |
16:06
🔗
|
|
Start has joined #archiveteam-bs |
16:28
🔗
|
|
Yoshimura has joined #archiveteam-bs |
17:10
🔗
|
godane |
so RTHK radio3 Hong Kong Today is starting to get uploaded: https://archive.org/search.php?query=subject%3A%22Hong+Kong+Today%22 |
17:18
🔗
|
|
t2t2 has quit IRC (Ping timeout: 633 seconds) |
17:35
🔗
|
godane |
so i maybe going thur RTHK TV videos |
17:36
🔗
|
godane |
it will be done sort of like how i did kpfa since the urls have the same pattens |
19:42
🔗
|
|
ravetcofx has joined #archiveteam-bs |
20:29
🔗
|
godane |
so turns out there is a pdfs of The Tech Newspaper |
20:29
🔗
|
godane |
a newspaper from MIT going back to 1881 |
20:41
🔗
|
godane |
how do i get this to not be invalid date: date -d "November 16, 1881" +%Y-%m-%d |
20:41
🔗
|
xmc |
seems to work for me |
20:41
🔗
|
xmc |
$ date -d "November 16, 1881" +%Y-%m-%d |
20:41
🔗
|
xmc |
1881-11-16 |
20:43
🔗
|
godane |
i keep getting invalid date on my slackware |
20:44
🔗
|
HCross2 |
godane: does the CMOS battery in your motherboard still have power in it? |
20:44
🔗
|
HCross2 |
Do you get a bios error on startup? |
20:45
🔗
|
godane |
hwclock still works |
20:45
🔗
|
godane |
i also never noticed any bio errors on boot |
20:45
🔗
|
xmc |
that probably has nothing to do with it ... |
20:46
🔗
|
xmc |
um, is your machine 32 or 64 bit |
20:46
🔗
|
xmc |
and when was it compiled? |
20:47
🔗
|
xmc |
could be you have a 32-bit time_t |
20:47
🔗
|
HCross2 |
Ahh, I misread. Formatting dates, not telling the current one. Sorry |
20:47
🔗
|
xmc |
also, what's your timezone set to |
20:47
🔗
|
godane |
i have 64 bit system with i486 slackware |
20:47
🔗
|
godane |
i don't think my timezone is set |
20:47
🔗
|
xmc |
try "echo $TZ" |
20:48
🔗
|
godane |
its blank |
20:48
🔗
|
xmc |
you're in eastern time, right? try TZ=EST5EDT date -d "November 16, 1881" +%Y-%m-%d |
20:49
🔗
|
godane |
its still invalid |
20:49
🔗
|
xmc |
hrum |
20:49
🔗
|
xmc |
try dates in 1899, 1900, and 1901? |
20:49
🔗
|
xmc |
basically, can you figure out when it becomes valid |
20:50
🔗
|
godane |
even this doesn't work: date -d "18811116" +%Y-%m-%d\ |
20:51
🔗
|
xmc |
i suspect it doesn't like 1881. but it could be that it's before 1900, before timezones were invented, or more than two billion seconds before 0 unix time |
20:52
🔗
|
godane |
looks like 1981 worked fine |
20:53
🔗
|
godane |
1902 is the earliest i could get |
20:54
🔗
|
xmc |
more narrowly: try dec 12 1901 and dec 14 1901 |
20:55
🔗
|
godane |
it works on december 14 1901 but not december 12 or 13 1901 |
20:55
🔗
|
xmc |
ding ding ding |
20:56
🔗
|
xmc |
you need a system with a 64-bit time_t |
20:56
🔗
|
xmc |
i486 slackware won't have that, you need x86_64 |
20:56
🔗
|
xmc |
or you can figure out a workaround that doesn't use `date` |
20:57
🔗
|
xmc |
computers are kind of garbage, sorry |
21:12
🔗
|
godane |
i just figure i will do the The Tech newspaper with volume and issue |
21:13
🔗
|
xmc |
and just put the date into the archive.org date-published field, that should be plenty |
21:13
🔗
|
godane |
i can't put it as November 16, 1881 format |
21:14
🔗
|
godane |
anyways i will be able to put date for issues past volume 21 |
21:14
🔗
|
* |
xmc nods |
21:22
🔗
|
|
johansch has joined #archiveteam-bs |
21:24
🔗
|
johansch |
So.. to continue my stream from #archiveteam (apologies =) ) |
21:25
🔗
|
johansch |
98 GB per day (2012) for imgur.com averages out to about 9 Mbit/s, if i did the the calculations correctly |
21:27
🔗
|
johansch |
assuming the did a 1.5x growth per year, that's about 45 Mbit/s today |
21:28
🔗
|
johansch |
or about 500 GB/day |
21:28
🔗
|
johansch |
that seems like something archive.org could handle... |
21:29
🔗
|
johansch |
but that would of course assume there's an efficient mechanism to get new items... |
21:36
🔗
|
SketchCow |
I don't think archive.org is going to back up imgur |
21:36
🔗
|
SketchCow |
Maybe they might save the top, say, 100,000 items |
21:37
🔗
|
|
RedType has left |
21:37
🔗
|
johansch |
wow, this is a testy place.. |
21:38
🔗
|
xmc |
TRYING to keep the announcements channel clear from discussion of potential project logistics |
21:38
🔗
|
xmc |
thank you for cooperating |
21:38
🔗
|
johansch |
so maybe then rename it #archiveteam-annoucements and #archiveteam-discussions ? |
21:39
🔗
|
xmc |
no |
21:39
🔗
|
johansch |
just a newb's point of view. |
21:39
🔗
|
xmc |
it's in the topic |
21:39
🔗
|
Frogging |
500GB per day is a lot, especially given that most of it is crap |
21:40
🔗
|
Frogging |
A better way would be selectively archiving popular or unique/original content as linked from various communities |
21:40
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
21:41
🔗
|
Igloo^_^ |
Have a criteria for x likes etc |
21:42
🔗
|
Igloo^_^ |
Could work, but getting access to the statistics (c/w)ould be difficult |
21:42
🔗
|
johansch |
@xmc i've go to congratulated you - way to make someone being enthusiastic about this topic feeling not-very-welcome.. |
21:42
🔗
|
Frogging |
Igloo^_^: A lot of it isn't submitted to the Imgur "gallery" with likes and all, it'd be linked from forums and subreddits |
21:43
🔗
|
johansch |
s/congratulated/congratulate |
21:43
🔗
|
xmc |
i'm not going to argue this inane topic with you endlessly, so as to leave this channel available for discussing the thing that you so clearly want to talk about |
21:44
🔗
|
xmc |
you're in the right place! continue! |
21:44
🔗
|
|
Frogging sets mode: +oo HCross2 joepie91 |
21:44
🔗
|
Igloo^_^ |
True Frogging, You'd need to have internal stats to get the information. Best have a method to have people request somehow (like how we're doing vine) |
21:45
🔗
|
|
Igloo^_^ is now known as Igloo |
21:45
🔗
|
SketchCow |
Like I said. Top 100,000 probably |
21:45
🔗
|
Frogging |
top by what measure? |
21:46
🔗
|
xmc |
why not by all six measures you can think of, at most it'll be less than a million pictures |
21:46
🔗
|
Frogging |
true |
21:46
🔗
|
johansch |
you could crawl reddit.. get the list of the top 10k subs, crawl them continously to catch imgur references |
21:48
🔗
|
johansch |
anything that gets on the frontpages of those 10k subs would be a pretty good candidate for archival |
21:48
🔗
|
|
bwn has joined #archiveteam-bs |
21:53
🔗
|
godane |
looks like i can get video from rthk.hk going back to at least 2012-09 |
22:02
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
22:32
🔗
|
squires |
johansch: the best way to process reddit data is through bigquery. it will be insanely more efficient to write your filter in bigquery than to crawl reddit pages. |
22:32
🔗
|
johansch |
how do you figure? you mean all of the reddit data is already in BQ? |
22:33
🔗
|
|
bwn has quit IRC (Ping timeout: 244 seconds) |
22:34
🔗
|
squires |
yes |
22:49
🔗
|
|
bwn has joined #archiveteam-bs |
22:51
🔗
|
i336 |
johansch: hey there. I'm also very new here. I was similarly squashed when I tried to post to #archiveteam-bs. that channel is basically for announcements only that everyone needs to read, so general chat makes people go "oh something important" and then when it's not they get mad |
22:51
🔗
|
godane |
so i'm at 1019k items now |
22:51
🔗
|
i336 |
johansch: chat in here is fine though. about imgur, that is a _lot_ of data. like many GB of data. the thing is, most of it is junk. |
22:52
🔗
|
i336 |
godane: I'm running slackware too, November 16, 1881 works fine for me, as does November 10, 1790 |
22:52
🔗
|
godane |
weird |
22:53
🔗
|
johansch |
hi i336.. i found someone who seems quite initiated who was happy to talk to me in private... :) |
22:53
🔗
|
i336 |
johansch: oh that's great to hear ^^ |
22:55
🔗
|
i336 |
godane: the only thing I can think is that our /etc/localtime is different. mine's EST |
22:56
🔗
|
johansch |
I would still like to re-state a fact: the way you guys are naming this pair of channels is like you're setting up newbs to run away, screaming. |
22:57
🔗
|
xmc |
thank you for your input |
22:57
🔗
|
i336 |
it is a bit unintuitive. no disrespect but I did feel like I got pelted with a bucket of water yesterday - but only a bit, I did get it in the end. |
22:57
🔗
|
xmc |
unfortunately it is impossible to rename channels once they're established |
22:57
🔗
|
johansch |
well, no, it's not |
22:57
🔗
|
* |
i336 mumbles something inaudiable about channel forwards |
22:57
🔗
|
xmc |
efnet does not provide any facilities in that regard |
22:58
🔗
|
i336 |
o.o |
22:58
🔗
|
xmc |
efnet does not provide any services at all |
22:58
🔗
|
xmc |
we're camping in the mountains |
22:58
🔗
|
i336 |
wow. I see. maybe a topic change could be in order |
22:58
🔗
|
xmc |
have to bring your own water, have to pack out all your cookie boxes |
22:58
🔗
|
i336 |
haha |
23:00
🔗
|
* |
i336 s/lengthy\/off-topic in #archiveteam-bs/this channel is for alerts only - use #archiveteam-bs for ALL discussions/ |
23:01
🔗
|
* |
i336 queues an edit to the wiki that says "if you're new, you probably want to join #archiveteam-bs" |
23:01
🔗
|
* |
i336 nudges the s// and queue in xmc's direction[1~ |
23:01
🔗
|
xmc |
why don't you edit the wiki? |
23:01
🔗
|
i336 |
okay! |
23:01
🔗
|
xmc |
it's a -*- wiki -*- |
23:01
🔗
|
i336 |
I didn't know I had edit permission. caveat emptor |
23:01
🔗
|
xmc |
everyone does, once they join |
23:02
🔗
|
i336 |
oh. |
23:02
🔗
|
* |
i336 facepalm |
23:03
🔗
|
i336 |
xmc: I can't edit the front page. that's where I'd want to add this. |
23:04
🔗
|
xmc |
turns out i can't either |
23:04
🔗
|
* |
i336 is unsure what to say at this point |
23:04
🔗
|
xmc |
SketchCow: beep |
23:06
🔗
|
|
i336_ has joined #archiveteam-bs |
23:06
🔗
|
i336_ |
that's better |
23:06
🔗
|
i336_ |
local irssi ftw |
23:06
🔗
|
xmc |
http://archiveteam.org/index.php?title=IRC#Special_ArchiveTeam_IRC_rules |
23:06
🔗
|
xmc |
i should point out that everything you've brought up is already listed |
23:07
🔗
|
i336_ |
you're right |
23:08
🔗
|
i336_ |
I was originally going to try and put #archiveteam-bs references to all the spots the channel is referenced on the main page, but as I read this, I switched tactics and decided it would be a better idea to just put links to this page near all the IRC references instead |
23:08
🔗
|
SketchCow |
Whut |
23:09
🔗
|
SketchCow |
A little hint |
23:10
🔗
|
SketchCow |
In you walk into a 7 year old channel going "u dun it rong" |
23:10
🔗
|
SketchCow |
You might not have all the facts. |
23:11
🔗
|
i336_ |
SketchCow: I completely agree. I figured this was an established place. I wanted to edit the wiki to make the rules more visible/accessible to newcomers so they can get up to speed much more quickly on the established way things are done. I have zero problem with them. |
23:11
🔗
|
xmc |
go ahead and edit the wiki to your satisfaction |
23:11
🔗
|
* |
i336_ reiterates the small tidbit about being unable to edit the main page |
23:12
🔗
|
i336_ |
maybe I can create/use a sandbox page, and someone can copy it over if they like it |
23:12
🔗
|
xmc |
other than that page, go ahead and edit the wiki to your satisfaction |
23:12
🔗
|
i336_ |
lol. okay |
23:15
🔗
|
i336_ |
(it isn't finished yet) |
23:17
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
23:19
🔗
|
|
Aranje has joined #archiveteam-bs |
23:21
🔗
|
SketchCow |
Edit it, put in your suggestions, and I'll approve them over. |
23:27
🔗
|
SketchCow |
I'm having feelings |
23:27
🔗
|
SketchCow |
Do I need to break some overzealous dreams here? |
23:28
🔗
|
zino |
You know you want to. |
23:28
🔗
|
i336_ |
okay, about to tackle the sandbox page. my two edits to http://archiveteam.org/index.php?title=IRC are the bold bit at the top and the first bullet point in the special IRC rules section. |
23:28
🔗
|
i336_ |
(thoughts/suggestions welcome) |
23:29
🔗
|
SketchCow |
This is all going to end sadly, I can see that. |
23:30
🔗
|
SketchCow |
So, hi, I'm Jason. |
23:30
🔗
|
SketchCow |
Somewhere, down in the bedrock of Archiveteam, is me. |
23:31
🔗
|
SketchCow |
You gotta really, really, really work hard these days to dig that far down. |
23:31
🔗
|
SketchCow |
These are good folks, they get amazing work done. |
23:31
🔗
|
SketchCow |
So 99.9999% of the time I'm not even needed in the channel. Magic happens. |
23:31
🔗
|
SketchCow |
You have successfully dug down. |
23:31
🔗
|
SketchCow |
Now you have me. Hi. |
23:31
🔗
|
* |
xmc waves quietly |
23:31
🔗
|
* |
i336_ waves quietly too |
23:32
🔗
|
SketchCow |
Now, you have multiple projects you've dreamed up. |
23:32
🔗
|
SketchCow |
One is to save ex.ua. |
23:32
🔗
|
xmc |
i've been around since day 1 also, but i'm different |
23:32
🔗
|
SketchCow |
One is to fuck with rover.info to get to ex.ua. |
23:33
🔗
|
i336_ |
SketchCow: rover.info has mostly the same info on it, but better |
23:33
🔗
|
SketchCow |
Somewhere down here, you have now encountered several roadblocks, enough that multiple people are messaging me. |
23:33
🔗
|
i336_ |
o.o |
23:33
🔗
|
i336_ |
okay, I don't want to do that. that's a bit of a freakout |
23:33
🔗
|
SketchCow |
It's pretty hard to get multiple people to message me, unless you are all buying me a cake. |
23:33
🔗
|
i336_ |
okay... whatever line I've stepped on I'd like to say I'm sorry upfront |
23:33
🔗
|
i336_ |
so, sorry |
23:34
🔗
|
SketchCow |
Realize there's no single way you can get past me. |
23:34
🔗
|
SketchCow |
Let's start with that. |
23:34
🔗
|
i336_ |
okay. not trying to do that, if I seem to be trying to do that then I've made some mistakes somewhere |
23:34
🔗
|
johansch |
this place does seem very talented at discouraging newbies. |
23:35
🔗
|
i336_ |
johansch: shh. let me figure out what's happened first. for what it's worth I have possibly been trying too hard. |
23:35
🔗
|
|
SketchCow sets mode: +b *!*webchat@*.02-2-6c6b701.cust.bredbandsbolaget.se |
23:35
🔗
|
|
johansch was kicked by SketchCow (johansch) |
23:35
🔗
|
SketchCow |
Yes, it is. |
23:35
🔗
|
i336_ |
yikes |
23:35
🔗
|
Frogging |
lol |
23:35
🔗
|
SketchCow |
Let me tell you what to not get hung up on. |
23:36
🔗
|
SketchCow |
- Do not get hung up on someone losing a massive collection of hollywood films |
23:36
🔗
|
SketchCow |
- Do not get hung up on a crappy .ua version of what_cd |
23:36
🔗
|
SketchCow |
- Do get hung up on unique Ukranian culture |
23:36
🔗
|
SketchCow |
- Do get hung up on unique support materials for same |
23:37
🔗
|
SketchCow |
If you are capable of finding the first two, then Archive Team can help |
23:37
🔗
|
SketchCow |
And the Internet Archive can probably take it. |
23:37
🔗
|
SketchCow |
If not, no. |
23:37
🔗
|
|
Stiletto has quit IRC (Ping timeout: 246 seconds) |
23:38
🔗
|
SketchCow |
But don't dream up ridiculous rube goldberg approaches using a combination on darknets and dvrs and dragging us down into DC+++ and god knows what else. |
23:38
🔗
|
i336_ |
oh lol |
23:38
🔗
|
SketchCow |
Do you know someone who can be mailed a hard drive, who can do the work. |
23:38
🔗
|
i336_ |
alright. |
23:38
🔗
|
i336_ |
noone with fast internet, unfortunately. :( |
23:38
🔗
|
SketchCow |
Do you know some way to grab unique material without trying to flood all our channels |
23:39
🔗
|
SketchCow |
Because we're about to ctrl+c and ctrl+v the government over here |
23:39
🔗
|
SketchCow |
I've hit refresh 3 times and I've still not had any mail to archiveteam from a furious johansch |
23:39
🔗
|
SketchCow |
I am disappoint. |
23:40
🔗
|
i336_ |
OK. let me try and explain a bit |
23:40
🔗
|
SketchCow |
We have interest and are willing to support trying to grab some useful parts of ex.ua |
23:40
🔗
|
SketchCow |
There's not much left to explain, but go ahead. |
23:40
🔗
|
SketchCow |
In here. |
23:40
🔗
|
SketchCow |
And not through /msg mania |
23:41
🔗
|
i336_ |
fussy but relevant background: I've had some long-term issues with storage and diskspace for about the last 10 years. TL;DR being given old people's computers and no central file server w/ a big disk = snowball of duplicates. trying to solve it, financial issues. so saving stuff is a big deal for me. #2. I saw all the people going "noooo D:" about what.cd, and I guess I fixated on that not happening |
23:41
🔗
|
i336_ |
again. #3. I don't come up with good ideas when things are going 1,000 miles an hour like this (20 days to save a gigantic website). I have known bugs seeing the bigger picture. |
23:42
🔗
|
xmc |
trust us, the what.cd data is safe (if inaccessible) |
23:43
🔗
|
i336_ |
so. unhelpful biases combined with a brain that focuses on detail too much and takes time to come up with ideas that are actually good = I'll acknowledge I've messed around with and mildly annoyed everyone here to some extent |
23:43
🔗
|
i336_ |
xmc: that is awesome to hear. is that 99% of it, or a 100% snapshot, if I may ask? I have nothing I can do with the info except smile, if it's 100% :) |
23:43
🔗
|
SketchCow |
What exactly is the difference between 99% and 100% |
23:44
🔗
|
i336_ |
"we had arrangements in place but when the plug got pulled our $mirroring_system didn't get the last bit" |
23:44
🔗
|
i336_ |
pretty much any applicable direct interpretation of that |
23:45
🔗
|
SketchCow |
Yes, I'm just philosophically asking why that's a meansurement of happiness. |
23:45
🔗
|
SketchCow |
Getting hung up on "we got most of it" vs. "we got all of it" is how you end up drinking too much and drowning in a bathtub |
23:45
🔗
|
SketchCow |
Ever hear of fdupes? Use fdupes |
23:46
🔗
|
i336_ |
fdupes would be awesome if I didn't have 10 HDDs I can't all have plugged in at once... some of which are clicking and need backing up to other disks before I can even use them |
23:48
🔗
|
i336_ |
I do hear you, I've thought of a lot of ideas, most of them boil down to "agh, TB HDDs are hundreds of dollars, and I have all this other medical stuff that's chewing up my funds first." (annoying and boring long story) |
23:49
🔗
|
xmc |
as a first approximation, https://archive.org/upload can help take the edge off |
23:50
🔗
|
* |
i336_ shakes fist at 50GB/mo ISP bandwidth cap (which is pretty much at capacity as the end of the month rolls around) |
23:50
🔗
|
i336_ |
besides that, I'm on ADSL2+. 80KB/s upload. |
23:51
🔗
|
godane |
i336_: you may need a local wikipedia |
23:51
🔗
|
xmc |
you can mail me sd cards or usb drives or whatever and i'll upload it, if you include metadata |
23:51
🔗
|
godane |
http://download.kiwix.org/zim/wikipedia/?C=M;O=D |
23:51
🔗
|
i336_ |
godane: I would actually consider that, but I don't browse the site frequently enough |
23:51
🔗
|
i336_ |
xmc: huh, nice. I'll keep that in mind :) |
23:51
🔗
|
godane |
i sort of figure that |
23:52
🔗
|
godane |
but its something |
23:52
🔗
|
xmc |
i336_: and i'll mail them back, because it sounds like money is an issue for you |
23:52
🔗
|
godane |
also grab No-Intro collection for tons of retro gaming |
23:52
🔗
|
i336_ |
godane: oh, okay :> |
23:53
🔗
|
godane |
this is also the RACHEL project from world possible : http://dev.worldpossible.org/cgi/rachelmods.pl |
23:53
🔗
|
godane |
it has wikihow |
23:53
🔗
|
i336_ |
xmc: well, not for the normal/average reasons - I'm just on disability support for mental health issues, and I understand that means I can't earn very much income because of it. but the various medicinal things I require chew up most of the budget. interesting deadlock I've been trying to headscratch how to solve for a while |
23:54
🔗
|
xmc |
mmm, tricky, that |
23:54
🔗
|
SketchCow |
godane would know nothing about that |
23:54
🔗
|
i336_ |
godane: this is interesting. what is this? |
23:54
🔗
|
i336_ |
SketchCow: what do you mean? |
23:55
🔗
|
i336_ |
godane: ...oh, it's a portable internet-in-a-box. cool! |
23:55
🔗
|
SketchCow |
godane, pretty much our most prolific contributor, is on disability |
23:55
🔗
|
SketchCow |
and is also awesome |
23:55
🔗
|
i336_ |
wow |
23:55
🔗
|
SketchCow |
never gets kicked much |
23:55
🔗
|
SketchCow |
Listens |
23:55
🔗
|
SketchCow |
I like that guy |
23:55
🔗
|
godane |
i have a inbox box |
23:56
🔗
|
i336_ |
I have problems with listening and understanding too; I often don't reach the "aha" moment until some measure of irritation and "what is this guy even..." has happened a bit. Known bug, trying to fix. |
23:56
🔗
|
xmc |
interim measure, keep your mouth closed a bit more often than is comfortable |
23:56
🔗
|
xmc |
it actually works really well |
23:56
🔗
|
i336_ |
I'll try that. interesting way of putting it. |
23:57
🔗
|
i336_ |
thanks |
23:57
🔗
|
xmc |
but yeah. when you think "this is confusing" just stay confused and read some more, then if you're *still* confused after a kinda uncomfortably long time, do what you would have done |
23:57
🔗
|
xmc |
but you have the advantage of having done your homework first |