Time |
Nickname |
Message |
00:05
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
00:08
🔗
|
|
twoTBHetz has joined #archiveteam-bs |
00:08
🔗
|
|
Sk1d has joined #archiveteam-bs |
00:13
🔗
|
|
m007a83 has joined #archiveteam-bs |
00:21
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
00:23
🔗
|
|
Sk1d has joined #archiveteam-bs |
00:24
🔗
|
jodizzle |
betamax: Are there any other URLs that need snscraping at the moment? Or candidate websites that need archiving? |
00:24
🔗
|
jodizzle |
On a related note, I was looking around for other sites that might have candidate links, and found this: https://vote-usa.org/forresearch.aspx |
00:24
🔗
|
jodizzle |
Seems comprehensive and well-structured, will probably try writing a scraper for it later. |
00:45
🔗
|
|
VerifiedJ has quit IRC (Quit: Leaving) |
00:51
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 265 seconds) |
00:52
🔗
|
|
Mateon1 has joined #archiveteam-bs |
00:58
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
01:01
🔗
|
|
Sk1d has joined #archiveteam-bs |
01:15
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
01:19
🔗
|
|
Sk1d has joined #archiveteam-bs |
01:31
🔗
|
|
twoTBHetz has quit IRC (Ping timeout: 260 seconds) |
01:35
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
01:39
🔗
|
|
Sk1d has joined #archiveteam-bs |
01:56
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
01:59
🔗
|
|
Sk1d has joined #archiveteam-bs |
02:05
🔗
|
|
Stilett0 has joined #archiveteam-bs |
02:10
🔗
|
|
Stiletto has quit IRC (Ping timeout: 633 seconds) |
02:15
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
02:20
🔗
|
|
Sk1d has joined #archiveteam-bs |
02:32
🔗
|
|
Ryz has quit IRC (west.us.hub irc.Prison.NET) |
02:32
🔗
|
|
achip has quit IRC (west.us.hub irc.Prison.NET) |
02:33
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
02:37
🔗
|
|
Sk1d has joined #archiveteam-bs |
02:50
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
02:54
🔗
|
|
Sk1d has joined #archiveteam-bs |
03:05
🔗
|
|
achip has joined #archiveteam-bs |
03:07
🔗
|
|
Ryz has joined #archiveteam-bs |
03:32
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
03:36
🔗
|
|
Sk1d has joined #archiveteam-bs |
03:40
🔗
|
|
bakJAA_ has joined #archiveteam-bs |
03:40
🔗
|
|
swebb sets mode: +o bakJAA_ |
03:40
🔗
|
|
JAA sets mode: +o bakJAA_ |
03:40
🔗
|
|
bakJAA has quit IRC (Read error: Connection reset by peer) |
03:41
🔗
|
|
kyounko has quit IRC (Ping timeout: 492 seconds) |
03:43
🔗
|
|
mgrytbak_ has quit IRC (Ping timeout: 492 seconds) |
03:47
🔗
|
|
mgrytbak_ has joined #archiveteam-bs |
03:50
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
03:54
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:00
🔗
|
ggus |
hi, it's possible to convert .warc to static html? |
04:06
🔗
|
jodizzle |
ggus: If the WARC is a WARC of an HTML file, sure. You just need to extract the contents. |
04:06
🔗
|
jodizzle |
I think there are a bunch of tools, but take a look at something like https://github.com/chfoo/warcat |
04:07
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
04:12
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:19
🔗
|
ggus |
jodizzle: thanks! i'll take a look |
04:28
🔗
|
|
Martle__ has quit IRC (Ping timeout: 252 seconds) |
04:43
🔗
|
|
qw3rty114 has joined #archiveteam-bs |
04:50
🔗
|
|
qw3rty113 has quit IRC (Read error: Operation timed out) |
04:50
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
04:54
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:57
🔗
|
|
odemg has quit IRC (Read error: Operation timed out) |
05:06
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
05:10
🔗
|
|
Sk1d has joined #archiveteam-bs |
05:11
🔗
|
|
odemg has joined #archiveteam-bs |
05:23
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
05:27
🔗
|
|
Sk1d has joined #archiveteam-bs |
06:01
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
06:05
🔗
|
|
Sk1d has joined #archiveteam-bs |
06:17
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
06:22
🔗
|
|
Sk1d has joined #archiveteam-bs |
06:36
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
06:41
🔗
|
|
Sk1d has joined #archiveteam-bs |
06:53
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
06:58
🔗
|
|
Sk1d has joined #archiveteam-bs |
07:43
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
07:48
🔗
|
|
Sk1d has joined #archiveteam-bs |
08:00
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
08:04
🔗
|
|
Sk1d has joined #archiveteam-bs |
08:12
🔗
|
|
adinbied has quit IRC (Remote host closed the connection) |
08:12
🔗
|
|
adinbied has joined #archiveteam-bs |
08:15
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
08:18
🔗
|
|
adinbied_ has joined #archiveteam-bs |
08:19
🔗
|
|
adinbied has quit IRC (Ping timeout: 252 seconds) |
08:20
🔗
|
|
Sk1d has joined #archiveteam-bs |
08:32
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
08:36
🔗
|
|
Sk1d has joined #archiveteam-bs |
08:51
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
08:56
🔗
|
|
Sk1d has joined #archiveteam-bs |
09:12
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
09:15
🔗
|
|
Sk1d has joined #archiveteam-bs |
09:18
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
09:28
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
09:30
🔗
|
Ryz |
Awww yeah, Flashpoint - archiving Adobe Flash games and animations and making it work: https://old.reddit.com/r/emulation/comments/9jt2yc/flashpoint_50_brand_new_launcher_two_new_plugins/ |
09:30
🔗
|
Ryz |
And a more recent topic: https://old.reddit.com/r/emulation/comments/9s0lz6/flashpoint_51_the_great_filter_playlists_a_new/ |
09:32
🔗
|
|
Sk1d has joined #archiveteam-bs |
09:37
🔗
|
Flashfire |
Bluemax you have a fan |
09:39
🔗
|
Ryz |
Oh, I known about this for over a 2-4 months now, and even though I can't play Adobe Flash games right now, I casually keep getting informed about it |
09:41
🔗
|
Ryz |
I may plan to help out on archiving Adobe Flash games too or doing anything related to the project |
09:45
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
09:48
🔗
|
|
Sk1d has joined #archiveteam-bs |
10:02
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
10:07
🔗
|
|
Sk1d has joined #archiveteam-bs |
10:08
🔗
|
|
godane has quit IRC (Ping timeout: 265 seconds) |
10:10
🔗
|
Flashfire |
Bluemax hangs out around here somewhere |
10:18
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
10:24
🔗
|
|
Sk1d has joined #archiveteam-bs |
10:54
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
10:58
🔗
|
betamax |
jodizzle: your approach of looking for sites that might have candidate links is a good one |
10:59
🔗
|
|
Sk1d has joined #archiveteam-bs |
11:00
🔗
|
betamax |
basically, it will get to a point where the databases of candidate info have been exhausted and then the effort required to discover more urls becomes too great |
11:00
🔗
|
betamax |
eg: googling candidate names and looking for their websites is a bit infeasable - at least in my opinion! |
11:07
🔗
|
Flashfire |
Betamax give me a small list of candidates. Combining it with my FTP search I may come up with a few useful things to throw into archivebot |
11:11
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
11:14
🔗
|
|
Sk1d has joined #archiveteam-bs |
11:27
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
11:31
🔗
|
|
Sk1d has joined #archiveteam-bs |
11:44
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
11:49
🔗
|
|
Sk1d has joined #archiveteam-bs |
12:08
🔗
|
|
Ryz has quit IRC (Quit: ChatZilla 0.9.92-rdmsoft [XULRunner 35.0.1/20150122214805]) |
12:53
🔗
|
|
t2t2 has quit IRC (Read error: Operation timed out) |
12:56
🔗
|
|
t2t2 has joined #archiveteam-bs |
12:59
🔗
|
|
godane has joined #archiveteam-bs |
13:08
🔗
|
|
adinbied_ is now known as adinbied |
13:12
🔗
|
adinbied |
Free Music Archive is returning a 503 for me, did it go down a day early? |
13:13
🔗
|
adinbied |
NVM, working again - was just a temp issue |
13:34
🔗
|
kiska |
Probably because AB is slamming it, I think |
13:39
🔗
|
anarcat |
still far from done too :/ |
13:42
🔗
|
kiska |
Currently there is: "622,025 in queue" |
13:44
🔗
|
|
Mateon1 has quit IRC (Remote host closed the connection) |
13:44
🔗
|
|
LFlare has joined #archiveteam-bs |
13:44
🔗
|
|
Mateon1 has joined #archiveteam-bs |
13:47
🔗
|
adinbied |
Looking at the AB log is showing a bunch of 503s at the moment |
13:51
🔗
|
adinbied |
With an 18 sec delay it seems to be doing better |
14:04
🔗
|
kiska |
It was 500-700ms that was doing better |
14:21
🔗
|
betamax |
Flashfire: when I did the discovery I just grabbed urls - so I don't actually have a list of candidate names |
14:31
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
14:34
🔗
|
|
Sk1d has joined #archiveteam-bs |
14:48
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
14:49
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
14:50
🔗
|
|
Sk1d has joined #archiveteam-bs |
15:06
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
15:11
🔗
|
|
Sk1d has joined #archiveteam-bs |
15:16
🔗
|
|
SketchCo1 is now known as SketchCow |
15:16
🔗
|
|
LFlare has quit IRC (Read error: Operation timed out) |
15:23
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
15:26
🔗
|
|
LFlare has joined #archiveteam-bs |
15:28
🔗
|
|
Sk1d has joined #archiveteam-bs |
15:43
🔗
|
|
Martle has joined #archiveteam-bs |
16:19
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
16:23
🔗
|
|
Sk1d has joined #archiveteam-bs |
16:24
🔗
|
|
Somebody2 has quit IRC (Read error: Operation timed out) |
16:25
🔗
|
|
DFJustin has quit IRC (Read error: Connection reset by peer) |
16:26
🔗
|
|
DFJustin has joined #archiveteam-bs |
16:26
🔗
|
|
swebb sets mode: +o DFJustin |
16:27
🔗
|
|
_Verified has joined #archiveteam-bs |
16:29
🔗
|
|
_Verified has quit IRC (Client Quit) |
16:30
🔗
|
|
VerifiedJ has quit IRC (Quit: Leaving) |
16:34
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
16:38
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
16:41
🔗
|
|
Sk1d has joined #archiveteam-bs |
16:57
🔗
|
|
schbirid has joined #archiveteam-bs |
17:07
🔗
|
|
LFlare has quit IRC (Ping timeout: 252 seconds) |
17:12
🔗
|
|
wp494 has quit IRC (Ping timeout: 260 seconds) |
17:12
🔗
|
|
wp494 has joined #archiveteam-bs |
17:14
🔗
|
|
Somebody2 has joined #archiveteam-bs |
17:29
🔗
|
|
LFlare has joined #archiveteam-bs |
17:47
🔗
|
astrid |
re gitorious |
17:48
🔗
|
astrid |
i mean. the storage layer is real infrastructure. |
17:48
🔗
|
astrid |
but the way we punched a hole thru the layers in the middle to present it to a virtual machine, is, a bit unusual |
17:57
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
18:02
🔗
|
|
Sk1d has joined #archiveteam-bs |
18:06
🔗
|
|
Ryz has joined #archiveteam-bs |
18:07
🔗
|
|
Martle_ has joined #archiveteam-bs |
18:08
🔗
|
|
Martle_ has quit IRC (Client Quit) |
18:09
🔗
|
|
Martle has quit IRC (Ping timeout: 252 seconds) |
19:03
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
19:06
🔗
|
|
Sk1d has joined #archiveteam-bs |
19:22
🔗
|
|
SimpBrain has quit IRC (Read error: Operation timed out) |
19:38
🔗
|
|
m007a83_ has joined #archiveteam-bs |
19:41
🔗
|
|
m007a83 has quit IRC (Read error: Operation timed out) |
19:43
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
19:46
🔗
|
|
Sk1d has joined #archiveteam-bs |
19:48
🔗
|
|
m007a83_ is now known as m007a83 |
19:59
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
20:03
🔗
|
|
Sk1d has joined #archiveteam-bs |
20:16
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
20:20
🔗
|
|
Sk1d has joined #archiveteam-bs |
20:26
🔗
|
|
icedice has joined #archiveteam-bs |
20:32
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
20:34
🔗
|
|
Sk1d has joined #archiveteam-bs |
20:42
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
20:48
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
20:52
🔗
|
|
Sk1d has joined #archiveteam-bs |
20:53
🔗
|
|
BlueMax has joined #archiveteam-bs |
21:16
🔗
|
betamax |
right, I just found a flaw in the regex I was using to extract URLS from the campaign sites I downloaded: |
21:16
🔗
|
betamax |
it doesn't deal with 'src="' which is what youtube , etc... use for embeded content |
21:17
🔗
|
betamax |
whatever I do will undoubtedly be wrong, does anyone have a good grep command for extracting urls from HTML |
21:17
🔗
|
betamax |
don't need to be inline URLs, just ones that are links or embedded content |
21:18
🔗
|
betamax |
preferably one that uses grep rather than egrep, for speed |
21:22
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
21:22
🔗
|
schbirid |
i like doing poop like: grep -Po 'src=".*?"' index.html |
21:23
🔗
|
schbirid |
or sed 's#src="#\nsrc=#g' index.html | grep src= |
21:23
🔗
|
schbirid |
:D |
21:23
🔗
|
schbirid |
first one aint poop actually, it's the best i know |
21:25
🔗
|
|
BlueMax has quit IRC (Ping timeout: 260 seconds) |
21:25
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
21:26
🔗
|
betamax |
coming from a regex idiot, is there any way to alter the first one so it doesn't have the 'src="' and '"' in it? |
21:27
🔗
|
jodizzle |
betamax: At that point I would just look around for a command line utility that does processing similar to what jq does for json |
21:28
🔗
|
jodizzle |
or look up some sed or awk command |
21:28
🔗
|
jodizzle |
If it's slow, you can always try and use parallel to speed things up |
21:30
🔗
|
betamax |
grep is low-memory, though (only have ~2GB to work with) |
21:30
🔗
|
|
Sk1d has joined #archiveteam-bs |
21:30
🔗
|
schbirid |
betamax: grep -Po 'src="\K.*?"' index.html |
21:30
🔗
|
schbirid |
via https://unix.stackexchange.com/questions/13466/can-grep-output-only-specified-groupings-that-match |
21:31
🔗
|
schbirid |
oops, trailing " |
21:31
🔗
|
schbirid |
grep -Po 'src="\K.*?(?=")' index.html |
21:31
🔗
|
schbirid |
wicked cool |
21:31
🔗
|
schbirid |
i always just used a sed pipe for that before. thanks |
21:32
🔗
|
betamax |
that's it! thanks |
21:33
🔗
|
schbirid |
np! |
21:38
🔗
|
betamax |
wait, that one doesn't work for single-quotes |
21:38
🔗
|
betamax |
is that an issue do you think? |
21:41
🔗
|
JAA |
grep -Po 'src=(["'"'"'])\K.*?(?=\1)' index.html |
21:57
🔗
|
|
w0rmybak has quit IRC (Quit: Ping timeout (120 seconds)) |
21:57
🔗
|
|
kiskabak has quit IRC (Quit: Ping timeout (120 seconds)) |
21:57
🔗
|
|
Flashback has quit IRC (Quit: Ping timeout (120 seconds)) |
21:58
🔗
|
|
w0rmybak has joined #archiveteam-bs |
22:00
🔗
|
|
kiskabak has joined #archiveteam-bs |
22:00
🔗
|
|
w0rmybak has quit IRC (Client Quit) |
22:00
🔗
|
jodizzle |
JAA: How does that one work? Are the quotes escaping eachother? |
22:00
🔗
|
|
w0rmybak has joined #archiveteam-bs |
22:01
🔗
|
JAA |
jodizzle: It's actually three separate strings: 'src=(["' + "'" + '])...' |
22:01
🔗
|
JAA |
That's because you can't have a single quote inside a single-quoted string. There is no backslash escaping or similar. |
22:02
🔗
|
JAA |
This is on the shell level, i.e. the argument actually sent to grep is just src=(["'])\K... |
22:03
🔗
|
|
Flashback has joined #archiveteam-bs |
22:06
🔗
|
jodizzle |
JAA: Ahh took me a second but I get it now. |
22:06
🔗
|
jodizzle |
That's pretty cool, thanks. |
22:06
🔗
|
JAA |
:-) |
22:10
🔗
|
|
schbirid has quit IRC (Remote host closed the connection) |
22:11
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
22:11
🔗
|
|
bakJAA_ is now known as bakJAA |
22:15
🔗
|
|
Sk1d has joined #archiveteam-bs |
22:30
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
22:33
🔗
|
|
Sk1d has joined #archiveteam-bs |
22:40
🔗
|
betamax |
JAA: regex n00b again. Any way to add that regex to one that works with 'href' links as well? |
22:41
🔗
|
betamax |
combining it with the one I was using before won't work: in my cluelessness it consisted of about 5 grep commands all chained together |
22:45
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
22:48
🔗
|
jodizzle |
betamax: Don't quote me, but I think if you did this it would work: grep -Po '(href|src)=(["'"'"'])\K.*?(?=\2)' |
22:49
🔗
|
betamax |
thanks, will try |
22:51
🔗
|
|
Sk1d has joined #archiveteam-bs |
22:51
🔗
|
jodizzle |
Seems to work on a random test html file for me. |
22:55
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
22:56
🔗
|
|
dashcloud has joined #archiveteam-bs |
22:56
🔗
|
|
Mateon1 has joined #archiveteam-bs |
23:18
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
23:19
🔗
|
|
adinbied has quit IRC (Left Channel.) |
23:39
🔗
|
|
adinbied has joined #archiveteam-bs |