| Time |
Nickname |
Message |
|
00:05
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
00:08
🔗
|
|
twoTBHetz has joined #archiveteam-bs |
|
00:08
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
00:13
🔗
|
|
m007a83 has joined #archiveteam-bs |
|
00:21
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
00:23
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
00:24
🔗
|
jodizzle |
betamax: Are there any other URLs that need snscraping at the moment? Or candidate websites that need archiving? |
|
00:24
🔗
|
jodizzle |
On a related note, I was looking around for other sites that might have candidate links, and found this: https://vote-usa.org/forresearch.aspx |
|
00:24
🔗
|
jodizzle |
Seems comprehensive and well-structured, will probably try writing a scraper for it later. |
|
00:45
🔗
|
|
VerifiedJ has quit IRC (Quit: Leaving) |
|
00:51
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 265 seconds) |
|
00:52
🔗
|
|
Mateon1 has joined #archiveteam-bs |
|
00:58
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
01:01
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
01:15
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
01:19
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
01:31
🔗
|
|
twoTBHetz has quit IRC (Ping timeout: 260 seconds) |
|
01:35
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
01:39
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
01:56
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
01:59
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
02:05
🔗
|
|
Stilett0 has joined #archiveteam-bs |
|
02:10
🔗
|
|
Stiletto has quit IRC (Ping timeout: 633 seconds) |
|
02:15
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
02:20
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
02:32
🔗
|
|
Ryz has quit IRC (west.us.hub irc.Prison.NET) |
|
02:32
🔗
|
|
achip has quit IRC (west.us.hub irc.Prison.NET) |
|
02:33
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
02:37
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
02:50
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
02:54
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
03:05
🔗
|
|
achip has joined #archiveteam-bs |
|
03:07
🔗
|
|
Ryz has joined #archiveteam-bs |
|
03:32
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
03:36
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
03:40
🔗
|
|
bakJAA_ has joined #archiveteam-bs |
|
03:40
🔗
|
|
swebb sets mode: +o bakJAA_ |
|
03:40
🔗
|
|
JAA sets mode: +o bakJAA_ |
|
03:40
🔗
|
|
bakJAA has quit IRC (Read error: Connection reset by peer) |
|
03:41
🔗
|
|
kyounko has quit IRC (Ping timeout: 492 seconds) |
|
03:43
🔗
|
|
mgrytbak_ has quit IRC (Ping timeout: 492 seconds) |
|
03:47
🔗
|
|
mgrytbak_ has joined #archiveteam-bs |
|
03:50
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
03:54
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
04:00
🔗
|
ggus |
hi, it's possible to convert .warc to static html? |
|
04:06
🔗
|
jodizzle |
ggus: If the WARC is a WARC of an HTML file, sure. You just need to extract the contents. |
|
04:06
🔗
|
jodizzle |
I think there are a bunch of tools, but take a look at something like https://github.com/chfoo/warcat |
|
04:07
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
04:12
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
04:19
🔗
|
ggus |
jodizzle: thanks! i'll take a look |
|
04:28
🔗
|
|
Martle__ has quit IRC (Ping timeout: 252 seconds) |
|
04:43
🔗
|
|
qw3rty114 has joined #archiveteam-bs |
|
04:50
🔗
|
|
qw3rty113 has quit IRC (Read error: Operation timed out) |
|
04:50
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
04:54
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
04:57
🔗
|
|
odemg has quit IRC (Read error: Operation timed out) |
|
05:06
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
05:10
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
05:11
🔗
|
|
odemg has joined #archiveteam-bs |
|
05:23
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
05:27
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
06:01
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
06:05
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
06:17
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
06:22
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
06:36
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
06:41
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
06:53
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
06:58
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
07:43
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
07:48
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
08:00
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
08:04
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
08:12
🔗
|
|
adinbied has quit IRC (Remote host closed the connection) |
|
08:12
🔗
|
|
adinbied has joined #archiveteam-bs |
|
08:15
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
08:18
🔗
|
|
adinbied_ has joined #archiveteam-bs |
|
08:19
🔗
|
|
adinbied has quit IRC (Ping timeout: 252 seconds) |
|
08:20
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
08:32
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
08:36
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
08:51
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
08:56
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
09:12
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
09:15
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
09:18
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
|
09:28
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
09:30
🔗
|
Ryz |
Awww yeah, Flashpoint - archiving Adobe Flash games and animations and making it work: https://old.reddit.com/r/emulation/comments/9jt2yc/flashpoint_50_brand_new_launcher_two_new_plugins/ |
|
09:30
🔗
|
Ryz |
And a more recent topic: https://old.reddit.com/r/emulation/comments/9s0lz6/flashpoint_51_the_great_filter_playlists_a_new/ |
|
09:32
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
09:37
🔗
|
Flashfire |
Bluemax you have a fan |
|
09:39
🔗
|
Ryz |
Oh, I known about this for over a 2-4 months now, and even though I can't play Adobe Flash games right now, I casually keep getting informed about it |
|
09:41
🔗
|
Ryz |
I may plan to help out on archiving Adobe Flash games too or doing anything related to the project |
|
09:45
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
09:48
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
10:02
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
10:07
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
10:08
🔗
|
|
godane has quit IRC (Ping timeout: 265 seconds) |
|
10:10
🔗
|
Flashfire |
Bluemax hangs out around here somewhere |
|
10:18
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
10:24
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
10:54
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
10:58
🔗
|
betamax |
jodizzle: your approach of looking for sites that might have candidate links is a good one |
|
10:59
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
11:00
🔗
|
betamax |
basically, it will get to a point where the databases of candidate info have been exhausted and then the effort required to discover more urls becomes too great |
|
11:00
🔗
|
betamax |
eg: googling candidate names and looking for their websites is a bit infeasable - at least in my opinion! |
|
11:07
🔗
|
Flashfire |
Betamax give me a small list of candidates. Combining it with my FTP search I may come up with a few useful things to throw into archivebot |
|
11:11
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
11:14
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
11:27
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
11:31
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
11:44
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
11:49
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
12:08
🔗
|
|
Ryz has quit IRC (Quit: ChatZilla 0.9.92-rdmsoft [XULRunner 35.0.1/20150122214805]) |
|
12:53
🔗
|
|
t2t2 has quit IRC (Read error: Operation timed out) |
|
12:56
🔗
|
|
t2t2 has joined #archiveteam-bs |
|
12:59
🔗
|
|
godane has joined #archiveteam-bs |
|
13:08
🔗
|
|
adinbied_ is now known as adinbied |
|
13:12
🔗
|
adinbied |
Free Music Archive is returning a 503 for me, did it go down a day early? |
|
13:13
🔗
|
adinbied |
NVM, working again - was just a temp issue |
|
13:34
🔗
|
kiska |
Probably because AB is slamming it, I think |
|
13:39
🔗
|
anarcat |
still far from done too :/ |
|
13:42
🔗
|
kiska |
Currently there is: "622,025 in queue" |
|
13:44
🔗
|
|
Mateon1 has quit IRC (Remote host closed the connection) |
|
13:44
🔗
|
|
LFlare has joined #archiveteam-bs |
|
13:44
🔗
|
|
Mateon1 has joined #archiveteam-bs |
|
13:47
🔗
|
adinbied |
Looking at the AB log is showing a bunch of 503s at the moment |
|
13:51
🔗
|
adinbied |
With an 18 sec delay it seems to be doing better |
|
14:04
🔗
|
kiska |
It was 500-700ms that was doing better |
|
14:21
🔗
|
betamax |
Flashfire: when I did the discovery I just grabbed urls - so I don't actually have a list of candidate names |
|
14:31
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
14:34
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
14:48
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
14:49
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
|
14:50
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
15:06
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
15:11
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
15:16
🔗
|
|
SketchCo1 is now known as SketchCow |
|
15:16
🔗
|
|
LFlare has quit IRC (Read error: Operation timed out) |
|
15:23
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
15:26
🔗
|
|
LFlare has joined #archiveteam-bs |
|
15:28
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
15:43
🔗
|
|
Martle has joined #archiveteam-bs |
|
16:19
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
16:23
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
16:24
🔗
|
|
Somebody2 has quit IRC (Read error: Operation timed out) |
|
16:25
🔗
|
|
DFJustin has quit IRC (Read error: Connection reset by peer) |
|
16:26
🔗
|
|
DFJustin has joined #archiveteam-bs |
|
16:26
🔗
|
|
swebb sets mode: +o DFJustin |
|
16:27
🔗
|
|
_Verified has joined #archiveteam-bs |
|
16:29
🔗
|
|
_Verified has quit IRC (Client Quit) |
|
16:30
🔗
|
|
VerifiedJ has quit IRC (Quit: Leaving) |
|
16:34
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
|
16:38
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
16:41
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
16:57
🔗
|
|
schbirid has joined #archiveteam-bs |
|
17:07
🔗
|
|
LFlare has quit IRC (Ping timeout: 252 seconds) |
|
17:12
🔗
|
|
wp494 has quit IRC (Ping timeout: 260 seconds) |
|
17:12
🔗
|
|
wp494 has joined #archiveteam-bs |
|
17:14
🔗
|
|
Somebody2 has joined #archiveteam-bs |
|
17:29
🔗
|
|
LFlare has joined #archiveteam-bs |
|
17:47
🔗
|
astrid |
re gitorious |
|
17:48
🔗
|
astrid |
i mean. the storage layer is real infrastructure. |
|
17:48
🔗
|
astrid |
but the way we punched a hole thru the layers in the middle to present it to a virtual machine, is, a bit unusual |
|
17:57
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
18:02
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
18:06
🔗
|
|
Ryz has joined #archiveteam-bs |
|
18:07
🔗
|
|
Martle_ has joined #archiveteam-bs |
|
18:08
🔗
|
|
Martle_ has quit IRC (Client Quit) |
|
18:09
🔗
|
|
Martle has quit IRC (Ping timeout: 252 seconds) |
|
19:03
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
19:06
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
19:22
🔗
|
|
SimpBrain has quit IRC (Read error: Operation timed out) |
|
19:38
🔗
|
|
m007a83_ has joined #archiveteam-bs |
|
19:41
🔗
|
|
m007a83 has quit IRC (Read error: Operation timed out) |
|
19:43
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
19:46
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
19:48
🔗
|
|
m007a83_ is now known as m007a83 |
|
19:59
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
20:03
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
20:16
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
20:20
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
20:26
🔗
|
|
icedice has joined #archiveteam-bs |
|
20:32
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
20:34
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
20:42
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
20:48
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
20:52
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
20:53
🔗
|
|
BlueMax has joined #archiveteam-bs |
|
21:16
🔗
|
betamax |
right, I just found a flaw in the regex I was using to extract URLS from the campaign sites I downloaded: |
|
21:16
🔗
|
betamax |
it doesn't deal with 'src="' which is what youtube , etc... use for embeded content |
|
21:17
🔗
|
betamax |
whatever I do will undoubtedly be wrong, does anyone have a good grep command for extracting urls from HTML |
|
21:17
🔗
|
betamax |
don't need to be inline URLs, just ones that are links or embedded content |
|
21:18
🔗
|
betamax |
preferably one that uses grep rather than egrep, for speed |
|
21:22
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
|
21:22
🔗
|
schbirid |
i like doing poop like: grep -Po 'src=".*?"' index.html |
|
21:23
🔗
|
schbirid |
or sed 's#src="#\nsrc=#g' index.html | grep src= |
|
21:23
🔗
|
schbirid |
:D |
|
21:23
🔗
|
schbirid |
first one aint poop actually, it's the best i know |
|
21:25
🔗
|
|
BlueMax has quit IRC (Ping timeout: 260 seconds) |
|
21:25
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
21:26
🔗
|
betamax |
coming from a regex idiot, is there any way to alter the first one so it doesn't have the 'src="' and '"' in it? |
|
21:27
🔗
|
jodizzle |
betamax: At that point I would just look around for a command line utility that does processing similar to what jq does for json |
|
21:28
🔗
|
jodizzle |
or look up some sed or awk command |
|
21:28
🔗
|
jodizzle |
If it's slow, you can always try and use parallel to speed things up |
|
21:30
🔗
|
betamax |
grep is low-memory, though (only have ~2GB to work with) |
|
21:30
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
21:30
🔗
|
schbirid |
betamax: grep -Po 'src="\K.*?"' index.html |
|
21:30
🔗
|
schbirid |
via https://unix.stackexchange.com/questions/13466/can-grep-output-only-specified-groupings-that-match |
|
21:31
🔗
|
schbirid |
oops, trailing " |
|
21:31
🔗
|
schbirid |
grep -Po 'src="\K.*?(?=")' index.html |
|
21:31
🔗
|
schbirid |
wicked cool |
|
21:31
🔗
|
schbirid |
i always just used a sed pipe for that before. thanks |
|
21:32
🔗
|
betamax |
that's it! thanks |
|
21:33
🔗
|
schbirid |
np! |
|
21:38
🔗
|
betamax |
wait, that one doesn't work for single-quotes |
|
21:38
🔗
|
betamax |
is that an issue do you think? |
|
21:41
🔗
|
JAA |
grep -Po 'src=(["'"'"'])\K.*?(?=\1)' index.html |
|
21:57
🔗
|
|
w0rmybak has quit IRC (Quit: Ping timeout (120 seconds)) |
|
21:57
🔗
|
|
kiskabak has quit IRC (Quit: Ping timeout (120 seconds)) |
|
21:57
🔗
|
|
Flashback has quit IRC (Quit: Ping timeout (120 seconds)) |
|
21:58
🔗
|
|
w0rmybak has joined #archiveteam-bs |
|
22:00
🔗
|
|
kiskabak has joined #archiveteam-bs |
|
22:00
🔗
|
|
w0rmybak has quit IRC (Client Quit) |
|
22:00
🔗
|
jodizzle |
JAA: How does that one work? Are the quotes escaping eachother? |
|
22:00
🔗
|
|
w0rmybak has joined #archiveteam-bs |
|
22:01
🔗
|
JAA |
jodizzle: It's actually three separate strings: 'src=(["' + "'" + '])...' |
|
22:01
🔗
|
JAA |
That's because you can't have a single quote inside a single-quoted string. There is no backslash escaping or similar. |
|
22:02
🔗
|
JAA |
This is on the shell level, i.e. the argument actually sent to grep is just src=(["'])\K... |
|
22:03
🔗
|
|
Flashback has joined #archiveteam-bs |
|
22:06
🔗
|
jodizzle |
JAA: Ahh took me a second but I get it now. |
|
22:06
🔗
|
jodizzle |
That's pretty cool, thanks. |
|
22:06
🔗
|
JAA |
:-) |
|
22:10
🔗
|
|
schbirid has quit IRC (Remote host closed the connection) |
|
22:11
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
22:11
🔗
|
|
bakJAA_ is now known as bakJAA |
|
22:15
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
22:30
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
22:33
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
22:40
🔗
|
betamax |
JAA: regex n00b again. Any way to add that regex to one that works with 'href' links as well? |
|
22:41
🔗
|
betamax |
combining it with the one I was using before won't work: in my cluelessness it consisted of about 5 grep commands all chained together |
|
22:45
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
|
22:48
🔗
|
jodizzle |
betamax: Don't quote me, but I think if you did this it would work: grep -Po '(href|src)=(["'"'"'])\K.*?(?=\2)' |
|
22:49
🔗
|
betamax |
thanks, will try |
|
22:51
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
22:51
🔗
|
jodizzle |
Seems to work on a random test html file for me. |
|
22:55
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
|
22:56
🔗
|
|
dashcloud has joined #archiveteam-bs |
|
22:56
🔗
|
|
Mateon1 has joined #archiveteam-bs |
|
23:18
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
|
23:19
🔗
|
|
adinbied has quit IRC (Left Channel.) |
|
23:39
🔗
|
|
adinbied has joined #archiveteam-bs |